The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Matching Patients in the cancer Dataset with vecmatch

Practical Example of the Vector Matching Workflow

In this chapter, we will describe an exemplary data analysis of the cancer dataset from the vecmatch package, matching multiple cohorts based on selected covariates. The vecmatch package follows a strict workflow based on the vector matching algorithm defined by Lopez and Gutman (2017), and it is advisable to follow it. The whole process consists of five steps to ensure the best possible matching quality using the vector matching algorithm.

Step 1: Data Exploration and Initial Imbalance Assessment

The vecmatch package provides two graphical functions for assessing initial dataset imbalance: raincloud() and mosaic(). The raincloud() function is designed for continuous variables, while mosaic() is for categorical variables. Both functions allow grouping by up to two categorical variables using the group and facet aesthetics and compute various statistical summaries, including significance tests and effect size coefficients.

In this example analysis, we focus on two predictor variables from the cancer dataset: the discrete sex and the continuous age. To evaluate differences in age across status groups, we can use the raincloud() function as follows:

library(vecmatch)
library(ggplot2)

raincloud(cancer,
  age,
  status,
  significance = "t_test",
  sig_label_color = TRUE,
  sig_label_size = 3,
  limits = c(10, 120)
) +
  scale_y_continuous(breaks = seq(10, 100, 10))

This plot presents the conditional distribution of age across different status groups. The standardized mean differences (SMD) are represented by the brackets on the right side of the jittered point plot. The results of the Student’s t-tests are displayed alongside the boxplots on the right, providing insights into the statistical significance of the differences in age between the groups.

Similarly, we can evaluate the differences in the sex variable using the mosaic() function. The code snippet below demonstrates this:

mosaic(cancer,
  status,
  sex,
  group_counts = TRUE,
  significance = TRUE
)

The plot shows the conditional distribution of sex across different status groups. The counts for all groups are displayed inside the corresponding boxes, representing each subgroup within the mosaic plot. Partial significance is visually coded through color filling aesthetics. The labels provide the results of the Chi-squared test of independence, summarizing the statistical assessment of the relationship between the sex and status variables.

Based on the plots, we can observe some imbalance in the age and sex variables. However, in both cases, the differences between the status groups appear to be rather moderate. All Student’s t-tests, adjusted for multiple comparisons, resulted in statistically insignificant differences in mean age values between the groups. Additionally, 2 out of the 6 calculated standardized mean differences (SMDs) fell below the 0.10 threshold, indicating relatively small differences.

After a closer examination of the mean age values, we can notice that they form two separate clusters. The first cluster includes controls and patients with adenomas, who tend to be younger, while the second cluster consists of patients with benign and malignant colorectal cancer, who are generally older. The primary objective of the matching process will then be to align these two clusters, ensuring they have a common mean age value.

There were no significant differences in the sex distributions across the status groups, as the overall Chi-square test of independence was insignificant. All partial significance tests were statistically insignificant, indicating small values of standardized Pearson’s residuals.

Given the relatively small differences between the status groups, the vector matching algorithm is expected to produce satisfactory results.

Step 2: Estimation of Generalized Propensity Scores

The next step in the vector matching algorithm is the calculation of generalized propensity scores (GPS) for the treatment variable. These scores represent the treatment assignment probabilities and are based on user-defined covariates. The relationship between the treatment variable and predictors can be easily defined using R-specific formula notation, allowing for both additive and interaction effects.

The vecmatch package provides a straightforward method for calculating the GPS-matrix with different methods using the estimate_gps() function. However, custom approaches can also be applied within the vector-matching workflow. The GPS-matrix used in subsequent analyses must have the exact same structure as the output of estimate_gps, containing the treatment variable and the treatment assignment probabilities for each level of the treatment variable. Importantly, the probabilities for each row in the GPS-matrix must sum to 1. The resulting data.frame has to be of class gps.

In the following example, we will use a simple formula with age and sex as predictors, allowing for an interaction effect. The method used to estimate the generalized propensity scores (GPS) will be a multinomial logistic regression, which can handle multiple levels of the treatment variable. The code for estimating the GPS using the multinomial logistic regression model is as follows:

formula_cancer <- status ~ age * sex
gps_matrix <- estimate_gps(formula_cancer,
  cancer,
  method = "multinom",
  reference = "control"
)
head(gps_matrix, 7)
#> gps object (generalized propensity scores)
#> • Number of units: 7
#> • Number of treatments: 4
#> • Treatment column: treatment
#> • GPS probability columns: control, adenoma, crc_benign, crc_malignant
#> • Treatment levels: control, adenoma, crc_benign, crc_malignant
#> • All columns except 'treatment' store probabilities in [0, 1].
#> 
#>   treatment   control   adenoma crc_benign crc_malignant
#> 1   control 0.3050838 0.3254835  0.2074179     0.1620149
#> 2   control 0.2367189 0.3264350  0.2217444     0.2151016
#> 3   control 0.2242229 0.3238480  0.2290992     0.2228299
#> 4   control 0.2796174 0.2875377  0.2373547     0.1954901
#> 5   control 0.3200282 0.3335474  0.1772439     0.1691805
#> 6   control 0.3310855 0.3689677  0.1730250     0.1269219
#> 7   control 0.2647639 0.2668865  0.2534869     0.2148627

Step 3: Calculating Common Support Region Borders

To remove observations that are not eligible for further matching, the borders of the common support region (CSR) must be calculated. The csregion() function facilitates this process by taking the gps_matrix object as its sole argument. It computes the CSR borders and automatically filters the input gps_matrix to include only the observations that fall within the calculated CSR borders.

The function returns an object of class csr_matrix, which contains the filtered data along with several additional attributes. These attributes enable users to manually filter the initial gps_matrix or access a summary of the CSR border calculation as a data.frame object. The csregion() function can be used as follows:

csr_matrix <- csregion(gps_matrix)

We can then compare the dimensions of the original gps_matrix and the filtered csr_matrix.

dim(gps_matrix)
#> [1] 1224    5
dim(csr_matrix)
#> [1] 1204    5

Clearly, 24 observations have been filtered out of the original matrix.

Next, as recommended by Lopez and Gutman (2017), the GPS can be recalculated using only the data within the CSR. However, in our experience, this additional step does not result in a significant improvement in matching quality, especially if the number of observations that fall outside of the CSR is relatively small in comparison to the whole dataset size. Therefore, we decided to bypass this step and proceed with matching directly on the filtered dataset.

Step 4: k-Means Clustering and Matching

The most crucial step of the vector matching algorithm is the k-means clustering and subsequent matching within the resulting clusters. This approach ensures that only observations with similar GPS vectors are matched, thereby justifying the name vector matching. The core components of this step are the stats::kmeans() function ( Core Team 2024), which performs clustering, and the Matching::Matchby() or optmatch::fullmatch() functions Hansen and Klopfer (2006), which match observations within each k-means cluster. These functionalities are combined into a single function named match_gps().

The match_gps() function takes the csr_matrix as input and returns a matched dataset. The matching process can be customized using various arguments, and it is advisable to experiment with different parameter combinations to optimize both matching quality and the number of matched samples (see appendix \(\ref{app:practical}\)). For the replicability of results, it is also crucial to set the random number generator seed before running the clustering and matching procedures. This ensures that the clustering assignments and matches are consistent across different runs of the analysis, allowing others to reproduce the results exactly.

The match_gps() function can be used as follows:

set.seed(164373)
matched_cancer <- match_gps(csr_matrix,
  caliper = 0.21,
  kmeans_cluster = 2,
  reference = "control",
  replace = TRUE,
  method = "fullopt",
  order = "desc"
)
#> Warning: Following specified arguments are not allowed for the method "fullopt"
#> and will be ignored: replace.
#> Warning: The `min_controls` argument for the method "fullopt" was not provided
#> and will default to `0`.
#> Warning: The `max_controls` argument for the method "fullopt" was not provided
#> and will default to `Inf`.

Step 5: Post-Matching Quality Assessment

The vecmatch package provides a dedicated function, balqual(), to assess the quality of post-matching results. This function compares various metrics and statistical summaries between the pre- and post-matching datasets, using a user-defined formula to evaluate balance. Its usage is straightforward and requires only the matched dataset (produced by match_gps()) and the formula used for the calculation of the GPS:

balqual(matched_cancer,
  formula_cancer,
  type = "smd",
  statistic = "max",
  round = 4
)
#> 
#> Matching Quality Evaluation
#> ================================================================================ 
#> 
#> Count table for the treatment variable:
#> -------------------------------------------------- 
#> Treatment                 | Before     | After      
#> -------------------------------------------------- 
#> adenoma                   | 373        | 282        
#> control                   | 313        | 265        
#> crc_benign                | 271        | 223        
#> crc_malignant             | 247        | 200        
#> -------------------------------------------------- 
#> 
#> 
#> Matching summary statistics:
#> ---------------------------------------- 
#> Total n before matching:  1204 
#> Total n after matching:       970 
#> % of matched observations:    80.56 %
#> Total  maximal   SMD value:   0.1887 
#> 
#> 
#> Maximal values :
#> -------------------------------------------------------------------------------- 
#> Variable                  | Coef  | Before       | After        | Quality      
#> -------------------------------------------------------------------------------- 
#> age                       | SMD   | 0.2150       | 0.1887       | Not Balanced 
#> sexF                      | SMD   | 0.1480       | 0.0931       | Balanced     
#> sexM                      | SMD   | 0.1480       | 0.0931       | Balanced     
#> age:sexF                  | SMD   | 0.1548       | 0.0898       | Balanced     
#> age:sexM                  | SMD   | 0.1591       | 0.1037       | Not Balanced 
#> --------------------------------------------------------------------------------

After assessing the quality with balqual(), we can combine both matched and unmatched datasets into a single data frame to visualize age and sex using raincloud() and mosaic():

matched_cancer$dataset <- "matched"
unmatched_cancer <- cancer
unmatched_cancer$dataset <- "unmatched"
data_full <- rbind(matched_cancer, unmatched_cancer)

raincloud(data_full,
  age,
  status,
  dataset,
  significance = "t_test",
  sig_label_color = TRUE,
  sig_label_size = 3,
  limits = c(10, 120)
) +
  scale_y_continuous(breaks = seq(10, 100, 10))

mosaic(data_full,
  status,
  sex,
  dataset,
  group_counts = TRUE,
  significance = TRUE
)

References

Core Team. 2024. : A Language and Environment for Statistical Computing. Vienna, Austria: Foundation for Statistical Computing. https://www.R-project.org/.

Hansen, Ben B, and Stephanie Olsen Klopfer. 2006. “Optimal Full Matching and Related Designs via Network Flows.” Journal of Computational and Graphical Statistics 15 (3): 609–27.

Lopez, Michael J, and Roee Gutman. 2017. “Estimation of Causal Effects with Multiple Treatments: A Review and New Ideas.” Statistical Science, 432–54.

Sekhon, Jasjeet S. 2011. “Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Package for .” Journal of Statistical Software 42: 1–52.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.

Matching Patients in the cancer Dataset with vecmatch