The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

an-application-of-the-OTrecod-package

Package installation

If the package OTrecod is not installed in their current R versions, users can install it by following the standard instruction:

install.packages("OTrecod")

Each time an R session is opened, the OTrecod library must be loaded with:

library(OTrecod)

Moreover, the development version of OTrecod can be installed actually from GitHub with:

# Install development version from GitHub
devtools::install_github("otrecoding/OTrecod")

 

I. The context

This vignette illustrates how to use the tools contained in the OTrecod package to solve a variable recoding problem frequently encountered in the context of data fusion. For more details about the theory of the algorithms used in the functions OT_outcome and OT_joint of the package, user can consult (1),(2) and the documentation linked to each function.

For this example, we have transformed an available database called samp.A from the StatMatch package (see (3) and the help of samp.A for more details). The samp.A database provides a limited number of variables observed at persons levels among those usually collected in the European Union Statistics on Income and Living Conditions Survey (the EU–SILC survey). In this database, the variable c.neti is a factor that corresponds to the persons net income of thousand of Euros categorized in 7 ordered classes. The raw distribution of c.neti is done in the following results:

library(StatMatch)
data(samp.A)
dim(samp.A)
#> [1] 3009   13
head(samp.A)
#>        HH.P.id area5 urb hsize hsize5 age    c.age sex marital edu7 n.income
#> 21384 10149.01    NE   1     1      1  85 (64,104]   2       3    3     1677
#> 35973 17154.02    NE   1     2      2  78 (64,104]   1       2    3    13520
#> 11774  5628.01    NO   2     1      1  48  (44,54]   1       3    3    20000
#> 32127 15319.01     S   1     2      2  78 (64,104]   1       2    1    12428
#> 6301   2973.05     S   2     5    >=5  17  [16,34]   1       1    1        0
#> 12990  6206.02     C   2     2      2  28  [16,34]   2       2    5        0
#>         c.neti        ww
#> 21384   (0,10] 3591.8939
#> 35973  (10,15]  415.1592
#> 11774  (15,20] 2735.4029
#> 32127  (10,15] 1239.5086
#> 6301  (-Inf,0] 5362.7588
#> 12990 (-Inf,0] 2077.7137
table(samp.A$c.neti)   # Repartition of c.neti in the sample
#> 
#>  (-Inf,0]    (0,10]   (10,15]   (15,20]   (20,25]   (25,35] (35, Inf] 
#>       564       671       541       500       307       272       154

To construct a standard recoding problem in data fusion, the samp.A database has been transformed as follows:

The R code related to these transformations was:

c.neti            = as.numeric(samp.A$c.neti)
samp.A$c.neti.bis = as.factor(ifelse(c.neti %in% c(1,2),1, 
                              ifelse(c.neti %in% c(3,4),2, 
                              ifelse(c.neti %in% c(5,6),3,4)))) 
data1 = samp.A[1:200,c(2:3,5,7:9,12:13)]
colnames(data1)[4] = "age" 
head(data1)
#>       area5 urb hsize5      age sex marital   c.neti        ww
#> 21384    NE   1      1 (64,104]   2       3   (0,10] 3591.8939
#> 35973    NE   1      2 (64,104]   1       2  (10,15]  415.1592
#> 11774    NO   2      1  (44,54]   1       3  (15,20] 2735.4029
#> 32127     S   1      2 (64,104]   1       2  (10,15] 1239.5086
#> 6301      S   2    >=5  [16,34]   1       1 (-Inf,0] 5362.7588
#> 12990     C   2      2  [16,34]   2       2 (-Inf,0] 2077.7137
data2 = samp.A[201:350,c(3,5:6,8:11,13:14)]
head(data2)
#>       urb hsize5 age sex marital edu7 n.income       ww c.neti.bis
#> 39565   3      2  81   2       2    1     6448 1129.274          1
#> 36490   1    >=5  48   1       3    2    16423 3082.331          2
#> 27529   2      2  66   1       2    3    15600 2433.020          2
#> 201     2      4  26   1       1    3    12876 1869.286          2
#> 31375   2      2  53   1       2    3    25633 2125.361          3
#> 3226    3      1  45   2       3    3     2611 1855.230          1

In conclusion, after transformation, we had:

table(data1$c.neti)        # 7 levels in data1
#> 
#>  (-Inf,0]    (0,10]   (10,15]   (15,20]   (20,25]   (25,35] (35, Inf] 
#>        37        40        34        49        18        13         9

table(data2$c.neti.bis)    # 4 levels in data2
#> 
#>  1  2  3  4 
#> 60 57 26  7

colnames(data1)
#> [1] "area5"   "urb"     "hsize5"  "age"     "sex"     "marital" "c.neti" 
#> [8] "ww"

colnames(data2)
#> [1] "urb"        "hsize5"     "age"        "sex"        "marital"   
#> [6] "edu7"       "n.income"   "ww"         "c.neti.bis"

intersect(colnames(data1), colnames(data2))   # the susbet of a priori shared variables
#> [1] "urb"     "hsize5"  "age"     "sex"     "marital" "ww"
OBJECTIVE

Assuming that the encoding of c.neti is unknown for the subjects of data2 and that the encoding of c.neti.bis is unknown for the subjects of data1, the functions of the OTrecod package solve this recoding problem by predicting the missing scale of the persons net income in one or the two databases. This solution allows the user to fusion his two databases and finally works with a bigger, unique and synthetic dataset.

For the rest of the study, c.neti and c.neti.bis are called the target variables of data1 and data2 respectively. we deliberately limit this example to the prediction of the variable c.neti.bis in data1 but note that the proposed approach would be the same for the prediction of c.neti in data1.

 

II. Harmonization of the data sources

Knowing the objective of the study, we first prepare the 2 databases to data fusion. The two functions dedicated to this data fusion in the OTrecod package expect a specific structure of database as argument. The merge_dbs function assists user in this task by:

For this, we fill in the arguments of the merge_dbs function by declaring as follows, the column indexes of all the ordinal variables in data1 and data2 (including the target variables indexes). Here there is no need to handle missing information because all the shared variables are complete, but note that it exists a specific argument (impute) in the function to take them into account when necessary.

db_test  = merge_dbs(data1, data2, NAME_Y = "c.neti", NAME_Z = "c.neti.bis",
                     ordinal_DB1 = c(2,3,4,7), ordinal_DB2 = c(1:2,6,9))
#> DBS MERGING in progress. Please wait ...
#> DBS MERGING OK
#> -----------------------
#> 
#> SUMMARY OF DBS MERGING:
#> Nb of removed subjects due to NA on targets: 0(0%)
#> Nb of removed covariates due to differences between the 2 bases: 1
#> Nb of remained covariates: 5
#> Imputation on incomplete covariates: NO

summary(db_test)
#>               Length Class      Mode     
#> DB_READY      8      data.frame list     
#> ID1_drop      0      -none-     character
#> ID2_drop      0      -none-     character
#> Y_LEVELS      7      -none-     character
#> Z_LEVELS      4      -none-     character
#> REMOVE1       1      -none-     character
#> REMOVE2       0      -none-     NULL     
#> REMAINING_VAR 5      -none-     character
#> IMPUTE_TYPE   1      -none-     character
#> DB1_raw       8      data.frame list     
#> DB2_raw       9      data.frame list     
#> SEED          1      -none-     numeric

db_test$REMAINING_VAR
#> [1] "hsize5"  "marital" "sex"     "urb"     "ww"

db_test$REMOVE1
#> [1] "age"

db_test$REMOVE2
#> NULL

db_test$ID1_drop; db_test$ID2_drop
#> character(0)
#> character(0)

db_test$DB_READY[c(1:5,201:205),]   # The 5 1st subjects of the two databases 
#>       DB        Y    Z hsize5 marital sex urb        ww
#> 21384  1   (0,10] <NA>      1       3   2   1 3591.8939
#> 35973  1  (10,15] <NA>      2       2   1   1  415.1592
#> 11774  1  (15,20] <NA>      1       3   1   2 2735.4029
#> 32127  1  (10,15] <NA>      2       2   1   1 1239.5086
#> 6301   1 (-Inf,0] <NA>    >=5       1   1   2 5362.7588
#> 39565  2     <NA>    1      2       2   2   3 1129.2739
#> 36490  2     <NA>    2    >=5       3   1   1 3082.3314
#> 27529  2     <NA>    2      2       2   1   2 2433.0201
#> 201    2     <NA>    2      4       1   1   2 1869.2859
#> 31375  2     <NA>    3      2       2   1   2 2125.3614

In output:

The variable Y now corresponds to the target variable c.neti with its specific encoding in data1 and missing values in data2. In the same way, the variable Z corresponds to c.neti.bis. This database have the structure expected as argument of the OT_outcome and OT_joint functions.

 

III. Selection of the matching variables

Among the set of shared variables kept in the output DB.READY database of the merge_dbs, it is important to discard those that never appear not good predictors of the persons net income whatever the considered database. The subset of remaining variables will be the matching variables. This selection of matching variables can be done using the select_pred function of the package.

This function proposes two levels of study to conclude:

At this step, the choice of the recoding algorithm (see (2)) to use a posteriori must be anticipated in the handling of the potential numeric shared variables (In this example, the ww variable is the only one concerned. Indeed, if the OT_outcome function runs whatever the type considered, this is not the case of the actual version of the OT_joint version which not allows the numeric variable. For this vignette we will test the two algorithms, so we decide to transform ww as categorial using the quartiles of its distribution as thresholds of the four classes. This transformation can be dealt by standard R function (like cut()) or directly in the functions using the convert_num and convert_class arguments dedicated to this task.

The selection of the best predictors of the persons net income must be done in the two databases separately but the selec_pred functions allowed overlayed databases as arguments using the following code:

# for data1
test_DB1 = select_pred(db_test$DB_READY,Y = "Y", Z = "Z", ID = 1, OUT = "Y",
                       quanti = 8, nominal = c(1,5:6,7), ordinal = c(2:4),
                       convert_num = 8, convert_class = 4,
                       thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.10,
                       RF = TRUE, RF_SEED = 3017)
#> The select_pred function is running for outcome= Y. Please wait ...
#> Risk of collinearity between predictors detected: Some predictors will be dropped during RF process
#> The process is now successfully completed
#> ---------
#> For comparison with another outcome from two overlayed tables  : 
#> just adapt the OUT option keeping all the others unchanged in the function
#> ---
#> For comparison with another outcome from two unoverlayed tables:
#> just adapt the arguments from Y to convert_class
#> ---------

# for data2
test_DB2 = select_pred(db_test$DB_READY,Y = "Y", Z = "Z", ID = 1, OUT = "Z",
                       quanti = 8, nominal = c(1,5:6,7), ordinal = c(2:4),
                       convert_num = 8, convert_class = 4,
                       thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.10,
                       RF = TRUE, RF_SEED = 3017)
#> The select_pred function is running for outcome= Z. Please wait ...
#> Risk of collinearity between predictors detected: Some predictors will be dropped during RF process
#> The process is now successfully completed
#> ---------
#> For comparison with another outcome from two overlayed tables  : 
#> just adapt the OUT option keeping all the others unchanged in the function
#> ---
#> For comparison with another outcome from two unoverlayed tables:
#> just adapt the arguments from Y to convert_class
#> ---------

As input:

Notice that it is important to keep the same arguments in the two selections (test_DB1 and test_DB2) for an optimal comparability. Here the standard RF process is used, and the ww variable is converted in categorical type before the selection using convert_num.

Let see the result for data1:

summary(test_DB1)
#>               Length Class      Mode     
#> seed          1      -none-     numeric  
#> outc          1      -none-     character
#> thresh        3      -none-     numeric  
#> convert_num   1      -none-     character
#> DB_USED       8      data.frame list     
#> vcrm_OUTC_cat 5      data.frame list     
#> cor_OUTC_num  5      data.frame list     
#> vcrm_X_cat    5      data.frame list     
#> cor_X_num     5      data.frame list     
#> FG_test       3      -none-     numeric  
#> collinear_PB  2      -none-     list     
#> drop_var      1      -none-     character
#> RF_PRED       4      -none-     numeric  
#> RF_best       2      -none-     character

test_DB1$vcrm_OUTC_cat
#>   name1   name2 V_Cramer CorrV_Cramer   N
#> 4     Y     sex   0.4384       0.4036 200
#> 3     Y marital   0.2452       0.1740 200
#> 5     Y     urb   0.1703       0.0000 200
#> 2     Y  hsize5   0.1634       0.0000 200
#> 6     Y      ww   0.1465       0.0000 200

test_DB1$collinear_PB
#> $VCRAM
#>     name1   name2 V_Cramer CorrV_Cramer   N
#> 21 hsize5 marital   0.5038       0.4858 200
#> 
#> $SPEARM
#> [1] name1       name2       RANK_COR    pv_COR_test N          
#> <0 rows> (or 0-length row.names)

# Results from RF
test_DB1$drop_var
#> [1] "marital"

test_DB1$RF_PRED
#>     sex  hsize5     urb      ww 
#> 86.8886  7.9677  5.1437  0.0000

In the set of shared variables, all variables are now categorical (factors or ordered factors). According to the vcrm_OUTC_cat output object, the best predictors of c.neti are: sex, marital, urb, hsize5, and ww in that order. Nevertheless, a risk of collinearity is detected between marital and hsize5. The RF process finally suggests to drop marital and to keep only sex, hsize5 and eventually urb as matching variables.

Let see the result for data2:

summary(test_DB2)
#>               Length Class      Mode     
#> seed          1      -none-     numeric  
#> outc          1      -none-     character
#> thresh        3      -none-     numeric  
#> convert_num   1      -none-     character
#> DB_USED       8      data.frame list     
#> vcrm_OUTC_cat 5      data.frame list     
#> cor_OUTC_num  5      data.frame list     
#> vcrm_X_cat    5      data.frame list     
#> cor_X_num     5      data.frame list     
#> FG_test       3      -none-     numeric  
#> collinear_PB  2      -none-     list     
#> drop_var      1      -none-     character
#> RF_PRED       4      -none-     numeric  
#> RF_best       2      -none-     character

test_DB2$vcrm_OUTC_cat
#>   name1   name2 V_Cramer CorrV_Cramer   N
#> 4     Z     sex   0.4783       0.4583 150
#> 2     Z  hsize5   0.2343       0.1693 150
#> 3     Z marital   0.1828       0.1160 150
#> 5     Z     urb   0.1386       0.0000 150
#> 6     Z      ww   0.0949       0.0000 150

test_DB2$collinear_PB
#> $VCRAM
#>     name1   name2 V_Cramer CorrV_Cramer   N
#> 21 hsize5 marital   0.3555       0.3177 150
#> 
#> $SPEARM
#> [1] name1       name2       RANK_COR    pv_COR_test N          
#> <0 rows> (or 0-length row.names)

# Results from RF
test_DB2$drop_var
#> [1] "marital"

test_DB2$RF_PRED
#>     sex  hsize5      ww     urb 
#> 86.1852 13.1555  0.6593  0.0000

According to the vcrm_OUTC_cat output object, the best predictors of c.neti.bis are: sex, hsize5, marital, urb, and ww in that order. A risk of collinearity is also detected between marital and hsize5. The RF process here suggests to drop marital and to keep only sex, hsize5 as matching variables.

In conclusion:

The matching variable groups can be:

We finally keep the last group and dropped the other ones for the rest of the example.

match_var = db_test$DB_READY[,-c(5,8)]
match_var[c(1:5,201:205),]
#>       DB        Y    Z hsize5 sex urb
#> 21384  1   (0,10] <NA>      1   2   1
#> 35973  1  (10,15] <NA>      2   1   1
#> 11774  1  (15,20] <NA>      1   1   2
#> 32127  1  (10,15] <NA>      2   1   1
#> 6301   1 (-Inf,0] <NA>    >=5   1   2
#> 39565  2     <NA>    1      2   2   3
#> 36490  2     <NA>    2    >=5   1   1
#> 27529  2     <NA>    2      2   1   2
#> 201    2     <NA>    2      4   1   2
#> 31375  2     <NA>    3      2   1   2

This overlayed database is now ready for recoding.

 

IV. Predicting the missing scales in the databases

The OTrecod package proposes actually two algorithms using optimal transportation theory (see (3) for details) to solve the recoding problem previously introduced. Each algorithm is stored in a unique function called OT_outcome and OT_joint. These two functions can predict the missing values of c.neti in data2, the missing values of c.neti.bis in data1 or the both using a same argument called which.DB.

As with the select_pred function, it is possible to transform directly the continuous matching variables if necessary using the convert_num and convert_class arguments.

Let see the R approach for the prediction of c.neti.bis in data1.

 

A) Transporting target variables to predict the missing scales

The algorithm from OT_outcome function solves an optimization problem to transfer the distributions of the target variables (or outcome) from one database to another. Using this result, the initial version of the algorithm (see (2)) transfers the distribution of c.neti.bis in data2 to the distribution of c.neti.bis in data1, and (inversely for c.neti if necessary). The algorithm executes in another distinct step, a nearest neighbor procedure to affect the indidividual predictions of c.neti.bis in data1. This version of the algorithm is actually available by writing sequential as argument method of the OT_outcome function.

The algorithm assumes the strong hypothesis that the variable c.neti.bis has the same distribution in data1 and data2 (and so on for the variable c.neti). If in this example, by construction, this hypothesis is obviously verified, there are also several contexts where this latter no longer holds.

Enrichments have been thus proposed (see (2)) to relax this hypothesis via the maxrelax argument of OT_outcome and directly provides the individual predictions without using the nearest neighbor process. This algorithm is actually available by assigning the method argument to optimal in input of the OT_outcome function. For our example, the corresponding R code for the prediction of c.neti.bis in data1 is:

# sequential algorithm
mod1_seq = OT_outcome(match_var, nominal = c(1,5:6), ordinal = 2:4, dist.choice = "E",
                      maxrelax = 0 , indiv.method = "sequential", which.DB = "A")
#> ---------------------------------------
#> OT PROCEDURE in progress ...
#> ---------------------------------------
#> Type                     = OUTCOME
#> Distance                 = Euclidean
#> Percent closest knn      = 100%
#> Relaxation parameter     = NO
#> Relaxation value         = 0
#> Individual pred process  = Sequential
#> DB imputed               = A
#> ---------------------------------------
summary(mod1_seq)
#>             Length Class      Mode   
#> time_exe      1    difftime   numeric
#> gamma_A      28    -none-     numeric
#> gamma_B      28    -none-     numeric
#> profile       5    data.frame list   
#> res_prox     16    -none-     list   
#> estimatorZA 812    -none-     numeric
#> estimatorYB   0    -none-     NULL   
#> DATA1_OT      8    data.frame list   
#> DATA2_OT      7    data.frame list

# optimal algorithm with no relaxation term
mod2_opt = OT_outcome(match_var, nominal = c(1,5:6), ordinal = 2:4, dist.choice = "E",
                      maxrelax = 0, indiv.method = "optimal", which.DB = "A")
#> ---------------------------------------
#> OT PROCEDURE in progress ...
#> ---------------------------------------
#> Type                     = R-OUTCOME
#> Distance                 = Euclidean
#> Percent closest knn      = 100%
#> Relaxation parameter     = NO
#> Relaxation value         = 0
#> Individual pred process  = Optimal
#> DB imputed               = A
#> ---------------------------------------
head(mod2_opt$profile)
#>        ID sex_2 urb_2 urb_3 hsize5
#> 21384 P_1     1     0     0      1
#> 35973 P_2     0     0     0      2
#> 11774 P_3     0     1     0      1
#> 6301  P_4     0     1     0      5
#> 12990 P_5     1     1     0      2
#> 39835 P_6     1     1     0      4

dim(mod2_opt$profile)
#> [1] 29  5

mod2_opt$gamma_A
#>               1     2          3          4
#> (-Inf,0]  0.185 0.000 0.00000000 0.00000000
#> (0,10]    0.170 0.030 0.00000000 0.00000000
#> (10,15]   0.000 0.105 0.06500000 0.00000000
#> (15,20]   0.000 0.245 0.00000000 0.00000000
#> (20,25]   0.000 0.000 0.04333333 0.04666667
#> (25,35]   0.000 0.000 0.06500000 0.00000000
#> (35, Inf] 0.045 0.000 0.00000000 0.00000000

head(mod2_opt$DATA1_OT)
#>       DB        Y    Z sex_2 urb_2 urb_3 hsize5 OTpred
#> 21384  1   (0,10] <NA>     1     0     0      1      1
#> 35973  1  (10,15] <NA>     0     0     0      2      3
#> 11774  1  (15,20] <NA>     0     1     0      1      2
#> 32127  1  (10,15] <NA>     0     0     0      2      3
#> 6301   1 (-Inf,0] <NA>     0     1     0      5      1
#> 12990  1 (-Inf,0] <NA>     1     1     0      2      1

As input:

In output, these two models provides lists of same structure:

 

B) Transporting target and shared variables to predict the missing scales

The objective of the algorithm integrated in the OT_joint function is the same as that of the OT_outcome function, but unlike OT_outcome transfers only the distributions of the target variables to solve the optimization problem, the OT_joint function now transports the joint distribution of outcomes and covariates: This approach makes it possible to no longer assumes the strong distributional hypothesis mentionned previously.

Nevertheless, as with the OT_outcome function, enrichments have been proposed to relax distributional constraints (see maxrelax argument and (2)) and eventually adds to the algorithm a regularization term to smooth the variations of target variables with respect to matching variables (see lambda.reg arguments).

To guarantee the convergence of the algorithm:

For our example, the corresponding R code for the prediction of c.neti.bis in data1 is:

# Algorithms with no enrichments
mod3_joint = OT_joint(match_var, nominal = c(1,5), ordinal = c(2:4,6), dist.choice = "E", 
                      prox.X = 0.10, which.DB = "A")
#> ---------------------------------------
#> OT JOINT PROCEDURE in progress ...
#> ---------------------------------------
#> Type                  = JOINT
#> Distance              = Euclidean
#> Percent closest       = 100%
#> Relaxation term       = 0
#> Regularization term   = 0
#> Aggregation tol cov   = 0.1
#> DB imputed            = A
#> ---------------------------------------
summary(mod3_joint)
#>             Length Class      Mode   
#> time_exe      1    difftime   numeric
#> gamma_A      28    -none-     numeric
#> gamma_B       0    -none-     NULL   
#> profile       4    data.frame list   
#> res_prox     16    -none-     list   
#> estimatorZA 812    -none-     numeric
#> estimatorYB   0    -none-     NULL   
#> DATA1_OT      7    data.frame list   
#> DATA2_OT      6    data.frame list

For a better understanding, the input arguments and output objects of the OT_joint function have been thought to be similar to those proposed by the OT_outcome function. As previously, the OTpred variable of the DATA1_OT object stores the individual predictions of c.neti.bis in data1.

 

V. Validation of the individual predictions

The verif_OT function gives access to different tools for assessing the reliability of the individual predictions proposed by the OT_outcome and OT_joint functions.

# Validation of the mod1_seq model
verif_out1 = verif_OT(mod1_seq, stab.prob = TRUE, min.neigb = 3)
verif_out1$conf.mat
#>            predZ
#> Y             1   2   3   4 Sum
#>   (-Inf,0]   37   0   0   0  37
#>   (0,10]     34   6   0   0  40
#>   (10,15]     1  21  12   0  34
#>   (15,20]     1  48   0   0  49
#>   (20,25]     1   1   9   7  18
#>   (25,35]     1   1  11   0  13
#>   (35, Inf]   9   0   0   0   9
#>   Sum        84  77  32   7 200

verif_out1$res.prox
#>        N   V_cram rank_cor 
#>  200.000    0.740    0.645

verif_out1$res.stab
#>          N min.N mean    sd
#> 1st DB 112     3 0.97 0.119


# Validation of the mod2_seq model
verif_out2 = verif_OT(mod2_opt, stab.prob = TRUE, min.neigb = 3)
verif_out2$conf.mat
#>            predZ
#> Y             1   2   3   4 Sum
#>   (-Inf,0]   37   0   0   0  37
#>   (0,10]     34   6   0   0  40
#>   (10,15]     0  21  13   0  34
#>   (15,20]     0  49   0   0  49
#>   (20,25]     0   0   9   9  18
#>   (25,35]     0   0  13   0  13
#>   (35, Inf]   9   0   0   0   9
#>   Sum        80  76  35   9 200
rate_good_pred = (37+40+31+45+18+13+9)/200
rate_good_pred
#> [1] 0.965

verif_out2$res.prox
#>        N   V_cram rank_cor 
#>  200.000    0.800    0.688

verif_out2$res.stab
#>          N min.N mean sd
#> 1st DB 112     3    1  0


# Validation of the mod3_opt model
verif_jt   = verif_OT(mod3_joint, stab.prob = TRUE, min.neigb = 3)
verif_jt$conf.mat
#>            predZ
#> Y             1   2   3   4 Sum
#>   (-Inf,0]   33   4   0   0  37
#>   (0,10]     32   8   0   0  40
#>   (10,15]     4  29   1   0  34
#>   (15,20]    11  19  19   0  49
#>   (20,25]     0  10   5   3  18
#>   (25,35]     1   2  10   0  13
#>   (35, Inf]   6   1   0   2   9
#>   Sum        87  73  35   5 200

verif_jt$res.prox
#>        N   V_cram rank_cor 
#>  200.000    0.550    0.613

verif_jt$res.stab
#>          N min.N  mean    sd
#> 1st DB 112     3 0.869 0.175

As input:

As output, we have selected the most interesting one to make the comparison between the three tested models:

CONCLUSION

The two first models are here more adapted when the strong distributional hypothesis previously introduced holds. Nevertheless, the mod3_jt model could be significantly improved by adding appropriate relaxation and regularization terms, so now, R user, test it by yourself !

References

  1. Gares V, Dimeglio C, Guernec G, Fantin F, Lepage B, Korosok MR, savy N (2019). On the use of optimal transportation theory to recode variables and application to database merging. The International Journal of Biostatistics.Volume 16, Issue 1, 20180106, eISSN 1557-4679.

  2. Gares V, Omer J (2020). Regularized optimal transport of covariates and outcomes in data recoding. Journal of the American Statistical Association.

  3. D’Orazio, M (2015). Integration and imputation of survey data in R: the StatMatch package. Romanian Statistical Review, vol. 63(2), pages 57-68

  4. Strobl C, Boulesteix A-L, Kneib T, Augustin T, Zeileis A (2008). Conditional Variable Importance for Random Forests. BMC Bioinformatics, 9, 307 - https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-307

 

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.