The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
The clustering of populations following admixture models is, for now, based on the K-sample test theory (see (Milhaud et al. 2024). Consider \(K\) samples. For \(i=1,...,K\), sample \(X^{(i)} = (X_1^{(i)}, ..., X_{n_i}^{(i)})\) follows \[L_i(x) = p_i F_i(x) + (1-p_i) G_i, \qquad x \in \mathbb{R}.\]
We still use IBM approach to perform pairwise hypothesis testing. The idea is to adapt the K-sample test procedure to obtain a data-driven method that cluster the \(K\) populations into \(N\) subgroups, characterized by a common unknown mixture component. The advantages of such an approach is twofold:
This clustering technique thus allows to cluster unobserved subpopulations instead of individuals. We call this algorithm the K-sample 2-component mixture clustering (K2MC).
We now detail the steps of the algorithm.
If $H_0$ is not rejected then $S_1 = \{x,y\}$,\\
Else $S_1 = \{x\}$, $S_{c+1} = \{y\}$ and then $c=c+1$.
Select $u={\rm argmin}\{d(i,j); i\in S_c, j\in S\setminus \bigcup_{k=1}^c S_k\}$;
Test $H_0$ the simultaneous equality of all the $f_j$, $j\in S_c$ :\\
If $H_0$ not rejected, then put $S_c=S_c\bigcup \{u\}$;\\
Else $S_{c+1} = \{u\}$ and $c = c+1$.
We present a case study with 5 populations to cluster on \(\mathbb{R}^+\), with Gamma-Exponential, Exponential-Exponential and Gamma-Gamma mixtures.
#set.seed(123)
## Simulate mixture data:
mixt1 <- twoComp_mixt(n = 6000, weight = 0.8,
comp.dist = list("gamma", "exp"),
comp.param = list(list("shape" = 16, "scale" = 1/4),
list("rate" = 1/3.5)))
mixt2 <- twoComp_mixt(n = 6000, weight = 0.7,
comp.dist = list("gamma", "exp"),
comp.param = list(list("shape" = 14, "scale" = 1/2),
list("rate" = 1/5)))
mixt3 <- twoComp_mixt(n = 6000, weight = 0.6,
comp.dist = list("gamma", "gamma"),
comp.param = list(list("shape" = 16, "scale" = 1/4),
list("shape" = 12, "scale" = 1/2)))
mixt4 <- twoComp_mixt(n = 6000, weight = 0.5,
comp.dist = list("exp", "exp"),
comp.param = list(list("rate" = 1/2),
list("rate" = 1/7)))
mixt5 <- twoComp_mixt(n = 6000, weight = 0.5,
comp.dist = list("gamma", "exp"),
comp.param = list(list("shape" = 14, "scale" = 1/2),
list("rate" = 1/6)))
data1 <- getmixtData(mixt1)
data2 <- getmixtData(mixt2)
data3 <- getmixtData(mixt3)
data4 <- getmixtData(mixt4)
data5 <- getmixtData(mixt5)
admixMod1 <- admix_model(knownComp_dist = mixt1$comp.dist[[2]],
knownComp_param = mixt1$comp.param[[2]])
admixMod2 <- admix_model(knownComp_dist = mixt2$comp.dist[[2]],
knownComp_param = mixt2$comp.param[[2]])
admixMod3 <- admix_model(knownComp_dist = mixt3$comp.dist[[2]],
knownComp_param = mixt3$comp.param[[2]])
admixMod4 <- admix_model(knownComp_dist = mixt4$comp.dist[[2]],
knownComp_param = mixt4$comp.param[[2]])
admixMod5 <- admix_model(knownComp_dist = mixt5$comp.dist[[2]],
knownComp_param = mixt5$comp.param[[2]])
## Look for the clusters:
admix_cluster(samples = list(data1,data2,data3,data4,data5),
admixMod = list(admixMod1,admixMod2,admixMod3,admixMod4,admixMod5),
conf_level = 0.95, tune_penalty = TRUE, tabul_dist = NULL, echo = FALSE,
n_sim_tab = 50, parallel = FALSE, n_cpu = 2)
#> Call:
#> admix_cluster(samples = list(data1, data2, data3, data4, data5),
#> admixMod = list(admixMod1, admixMod2, admixMod3, admixMod4,
#> admixMod5), conf_level = 0.95, tune_penalty = TRUE, tabul_dist = NULL,
#> echo = FALSE, n_sim_tab = 50, parallel = FALSE, n_cpu = 2)
#>
#> Number of detected clusters across the samples provided: 3.
#>
#> List of samples involved in each built cluster (in numeric format, i.e. inside c()) :
#> - Cluster #1: vector of populations c(1, 3)
#> - Cluster #2: vector of populations 4
#> - Cluster #3: vector of populations c(2, 5)
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.