The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
In this document we introduce the functionality available in
stops
for fitting multidimensional scaling (MDS; Borg &
Groenen 2005) or proximity scaling (PS) models either on their own or by
utilizing the STOPS idea. We start with a short introduction to PS and
the models that are available. We then explain fitting of these models
with the stops
package. Next, we introduce the reader to
STOPS (Rusch, Mair & Hornik, 2020; Rusch, Mair & Hornik, 2023a),
a method to select hyperparameters in flexible mutlidimensional scaling
methods based on structures in the configuration. We show how to fit
those. We also mention P-COPS as a special case and show the connection
to COPS (Rusch, Mair & Hornik, 2021). For illustration throughout
this document we use the smacof::kinshipdelta
data set
(Rosenberg & Kim, 1975) which lists percentages of how often 15
kinship terms were not grouped together by college students.
library(stops)
For proximity scaling (PS) or multidimensional scaling (MDS) the input is typically an \(N\times N\) matrix \(\Delta^*=f(\Delta)\), a matrix of proximities with elements \(\delta^*_{ij}\), that is a function of a matrix of observed non-negative dissimilarities \(\Delta\) with elements \(\delta_{ij}\). \(\Delta^*\) usually is symmetric (but does not need to be). The main diagonal of \(\Delta\) is 0. We call a \(f: \delta_{ij} \mapsto \delta^*_{ij}\) a proximity transformation function. In the MDS literature these \(\delta_{ij}^*\) are often called dhats or disparities. The problem that proximity scaling solves is to locate an \(N \times M\) matrix \(X\) (the configuration) with row vectors \(x_i, i=1,\ldots,N\) in low-dimensional space \((\mathbb{R}^M, M \leq N)\) in such a way that transformations \(g(d_{ij}(X))\) of the fitted distances \(d_{ij}(X)=d(x_i,x_j)\)—i.e., the distance between different \(x_i, x_j\)—approximate the \(\delta^*_{ij}\) as closely as possible. We call a \(g: d_{ij}(X) \mapsto d_{ij}^*(X)\) a distance transformation function. In other words, proximity scaling means finding \(X\) so that \(d^*_{ij}(X)=g(d_{ij}(X))\approx\delta^*_{ij}=f(\delta_{ij})\). There may also be weights transformed weight \(w^*_{ij}=h(w_{ij})\) with \(h: w_{ij} \mapsto w_{ij}^*\) being a weight transformation function.
This approximation \(D^*(X)\) to the matrix \(\Delta^*\) is found by defining a fit criterion (the loss function), \(\sigma_{PS}(X)=L(\Delta^*,D^*(X))\), that is used to measure how closely \(D^*(X)\) approximates \(\Delta^*\).
The loss function used is then minimized to find the vectors \(x_1,\dots,x_N\), i.e., \[\begin{equation} \label{eq:optim} \arg \min_{X}\ \sigma_{PS}(X). \end{equation}\]
The first popular type of PS supported in stops
is based
on the loss function type stress (Kruskall 1964), employing a
quadratic loss function.
A general formulation of a loss function based on a quadratic loss is \[\begin{equation} \label{eq:stress} \sigma_{PS}(X)=\sigma_{stress}(X)=\sum^N_{i=1}\sum^N_{j=1} w^*_{ij}\left[d^*_{ij}(X)-\delta^*_{ij}\right]^2=\sum^N_{i=1}\sum^N_{j=1} h(w_{ij})\left[g\left(d_{ij}(X)\right)-f(\delta_{ij})\right]^2 \end{equation}\] Here, the \(w_{ij}\) are finite weights, with \(w_{ij}=0\) if the entry is missing and \(w_{ij}=1\) otherwise. There are a number of optimization techniques one can use to solve this optimization problem.
The distances used is usually some type of Minkowski distance (\(p > 0\)) as the distance fitted to the points in the configuration, \[\begin{equation} \label{eq:dist} d_{ij}(X) = ||x_{i}-x_{j}||_p=\left( \sum_{m=1}^M |x_{im}-x_{jm}|^p \right)^{1/p} \ i,j = 1, \dots, N. \end{equation}\] Typically, the norm used is the Euclidean norm, so \(p=2\). In standard MDS \(g(\cdot)=f(\cdot)=I(\cdot)\), the identity function.
This formulation enables one to express a large number of PS methods
many of which are implemented in stops
. In
stops
we allow to use specific choices for \(f(\cdot)\), \(g(\cdot)\) and \(h(\cdot)\) from the family of power
transformations so one can fit the following stress models:
For all of the above models but local MDS one can use the function
powerStressMin
which uses majorization to find the solution
(De Leeuw, 2014) . The function allows to specify a kappa
,
lambda
and nu
argument as well as a
weightmat
(the \(w_{ij}\)). LMDS can be fitted with the
lmds
function. For approximate power stress there also is a
convenience function apStressMin
.
Note that the implementation in powerStressMin
is more a
proof-of-concept than optimal. Majorizing this type of stress is a
pretty hard problem and the code we use relies on a while loop in pure
R. We plan to speed the loop up with a re-implementation in C in the
future.
The second popular type of PS supported in stops
is
based on the loss function type . Here the \(\Delta^*\) are a transformation of the
\(\Delta\), \(\Delta^*= f (\Delta)\) so that \(f(\cdot)=-(h\circ l)(\cdot)\) where \(l\) is any function and \(h(\cdot)\) is a double centering operation,
\(h(\Delta)=\Delta-\Delta_{i.}-\Delta_{.j}+\Delta_{..}\)
where \(\Delta_{i.}, \Delta_{.j},
\Delta_{..}\) are matrices consisting of the row, column and
grand marginal means respectively. These then get approximated by
(functions of) the inner product matrices of \(X\) \[\begin{equation}
\label{eq:dist2}
d_{ij}(X) = \langle x_{i},x_{j} \rangle
\end{equation}\] We can thus express classical scaling as a
special case of the general PS loss with \(d_{ij}(X)\) as an inner product, \(g(\cdot) = I(\cdot)\) and \(f(\cdot)=-(h \circ I)(\cdot)\).
If we again allow power transformations for \(g(\cdot)\) and \(f(\cdot)\) one can fit the following strain
models with stops
In stops
we have a wrapper to cmdscale
(overloading the base
function) which extend functionality
by offering an object that matches smacofP
objects with
corresponding methods.
The third type of PS supported in stops
is based on the
energy type losses of pairwise attraction and repulsion between objects
(Chen & Buja, 2014, Rusch, Mair, Hornik 2023a). It is related to
graph drawing: if vertices are equated with objects, the edge weights
are the \(\delta^*_{ij}\), the node
positions in the layout is the \(X\)
and the distance between nodes in the layout are \(d_{ij}(X)\).
One can express badness-of-fit as comprising a repulsion part \(\propto -\delta^*_{ij}d^*_{ij}(X)\) and an
attraction part \(\propto
d^{*\mu}_{ij}(X)\) (Chen & Buja, 2014). This is
\[\begin{equation}
\label{eq:energy}
\sigma_{PS}(X)=\sigma_{EM}(X) = \sum_{i<j} w^*_{ij} \left(a
d^{*\mu}_{ij}(X) - b \delta^*_{ij}d^*_{ij}(X) \right)
\end{equation}\] with \(a,b\)
being constants and \(d^*_{ij}(X)=g(d_{ij}(X))\) and \(\delta^*_{ij}=f(d_{ij}(X))\). For \(\mu=2, a=1\) and \(b=2\) this is a stress model where terms
depending solely on \(\delta^*_{ij}\)
are disregarded for finding \(X\).
Specific instances of energy models that can be fitted in
stops
are:
The objects returned from powerStressMin
,
lmds
, bcStressMin
and apStressMin
inherit from class smacofP
which extends the
smacof
classes (De Leeuw & Mair, 2009) to allow for the
transformations. Apart from that all the objects are made so that they
are compatible to methods from smacof
and also inherit from
smacof
. Accordingly, the following S3 methods are
available:
Method | Description |
---|---|
Prints the object | |
summary | A summary of the object |
plot | 2D Plots of the object |
plot3d | Dynamic 3D configuration plot |
plot3dstatic | Static 3D configuration plot |
residuals | Residuals |
coef | Model Coefficients |
Let us illustrate the usage.
First we setup our dissimilarity matrix \([\delta_{ij}]\) and call it
dis
.
<-as.matrix(smacof::kinshipdelta) dis
We now fit a series of MDS models available in
stops
.
stress
)<-powerStressMin(dis,kappa=1,lambda=1)
res1 res1
##
## Call:
## powerStressMin(delta = dis, kappa = 1, lambda = 1)
##
## Model: Power Stress SMACOF
## Number of objects: 15
## Stress-1 value: 0.264
## Number of iterations: 5352
sammon
mapping<-powerStressMin(dis,kappa=1,lambda=1,nu=-1,weightmat=dis)
res2 res2
##
## Call:
## powerStressMin(delta = dis, kappa = 1, lambda = 1, nu = -1, weightmat = dis)
##
## Model: Power Stress SMACOF
## Number of objects: 15
## Stress-1 value: 0.289
## Number of iterations: 81666
Alternatively, one can use the faster sammon
function
from MASS
(Venables & Ripley, 2002) for which we
provide a wrapper that adds class attributes and methods (and overloads
the function).
<-sammon(dis) res2a
## Initial stress : 0.17053
## stress after 3 iters: 0.10649
res2a
##
## Call: sammon(d = dis)
##
## Model: Sammon Scaling
## Number of objects: 15
## Stress: 0.1064925
elastic
scaling<-powerStressMin(dis,kappa=1,lambda=1,nu=-2,weightmat=dis)
res3 res3
##
## Call:
## powerStressMin(delta = dis, kappa = 1, lambda = 1, nu = -2, weightmat = dis)
##
## Model: Power Stress SMACOF
## Number of objects: 15
## Stress-1 value: 0.305
## Number of iterations: 1e+05
sstress
model<-powerStressMin(dis,kappa=2,lambda=2)
res4 res4
##
## Call:
## powerStressMin(delta = dis, kappa = 2, lambda = 2)
##
## Model: Power Stress SMACOF
## Number of objects: 15
## Stress-1 value: 0.346
## Number of iterations: 47130
rstress
model (with \(r=1\) as \(r=\kappa/2\))<-powerStressMin(dis,kappa=2,lambda=1)
res5 res5
##
## Call:
## powerStressMin(delta = dis, kappa = 2, lambda = 1)
##
## Model: Power Stress SMACOF
## Number of objects: 15
## Stress-1 value: 0.404
## Number of iterations: 21201
powermds
model (i.e., \(\rho=1\))<-powerStressMin(dis,kappa=2,lambda=1.5)
res6 res6
##
## Call:
## powerStressMin(delta = dis, kappa = 2, lambda = 1.5)
##
## Model: Power Stress SMACOF
## Number of objects: 15
## Stress-1 value: 0.367
## Number of iterations: 50564
powersammon
model<-powerStressMin(dis,kappa=2,lambda=1.5,nu=-1,weightmat=dis)
res7 res7
##
## Call:
## powerStressMin(delta = dis, kappa = 2, lambda = 1.5, nu = -1,
## weightmat = dis)
##
## Model: Power Stress SMACOF
## Number of objects: 15
## Stress-1 value: 0.436
## Number of iterations: 1e+05
powerelastic
scaling<-powerStressMin(dis,kappa=2,lambda=1.5,nu=-2,weightmat=dis)
res8 res8
##
## Call:
## powerStressMin(delta = dis, kappa = 2, lambda = 1.5, nu = -2,
## weightmat = dis)
##
## Model: Power Stress SMACOF
## Number of objects: 15
## Stress-1 value: 0.519
## Number of iterations: 1e+05
powerstress
model<-powerStressMin(dis,kappa=2,lambda=1.5,nu=-1.5,weightmat=2*1-diag(nrow(dis)))
res9 res9
##
## Call:
## powerStressMin(delta = dis, kappa = 2, lambda = 1.5, nu = -1.5,
## weightmat = 2 * 1 - diag(nrow(dis)))
##
## Model: Power Stress SMACOF
## Number of objects: 15
## Stress-1 value: 0.367
## Number of iterations: 57237
summary(res9)
##
## Configurations:
## D1 D2
## Aunt -0.1225 0.2498
## Brother 0.1964 -0.1400
## Cousin 0.0525 0.3099
## Daughter -0.2050 -0.1256
## Father 0.1639 -0.1822
## Granddaughter -0.2358 -0.0531
## Grandfather 0.2146 -0.1336
## Grandmother -0.2360 -0.0868
## Grandson 0.2147 -0.1079
## Mother -0.2098 -0.1363
## Nephew 0.1707 0.2110
## Niece -0.1231 0.2442
## Sister -0.2219 -0.0967
## Son 0.1702 -0.1707
## Uncle 0.1710 0.2181
##
##
## Stress per point:
## SPP SPP(%)
## Niece 0.0008 4.6861
## Nephew 0.0008 4.7097
## Aunt 0.0008 4.9367
## Uncle 0.0008 5.0269
## Daughter 0.0010 6.0875
## Son 0.0010 6.2785
## Father 0.0011 6.4569
## Mother 0.0011 6.6921
## Cousin 0.0011 6.8096
## Sister 0.0012 7.2765
## Brother 0.0012 7.2882
## Grandson 0.0013 7.9701
## Granddaughter 0.0013 8.0950
## Grandmother 0.0015 8.8103
## Grandfather 0.0015 8.8761
lmds
model<-lmds(dis,tau=0.5,k=3)
res10 res10
##
## Call:
## lmds(delta = dis, k = 3, tau = 0.5)
##
## Model: Local MDS
## Number of objects: 15
## Stress-1 value: 0.302
## Number of iterations: 80
apstress
model (approximate power stress)<-apStressMin(dis,tau=0.5,ups=2)
res11 res11
##
## Call:
## apStressMin(delta = dis, tau = 0.5, ups = 2)
##
## Model: Approximate Power Stress SMACOF
## Number of objects: 15
## Stress-1 value: 0.217
## Number of iterations: 16
We can visualize the obtained configurations with plot. For
smacofP
objects we have the following plots.
plot(res9)
plot(res9,"transplot")
plot(res9,"Shepard")
plot(res9,"resplot")
plot(res9,"bubbleplot")
powerstrain
model is rather easy to fit with simply
subjecting the dissimilarity matrix to some power. Here we use \(\lambda=3\).<-cmdscale(kinshipdelta^3)
resc resc
##
## Call: cmdscale(d = kinshipdelta^3)
##
## Model: Torgerson-Gower Scaling
## Number of objects: 15
## GOF: 0.4257747 0.6281985
summary(resc)
##
## Configurations:
## D1 D2
## Aunt 178193.10 204986.70
## Brother -174369.39 -94357.29
## Cousin -48355.46 265057.17
## Daughter 169149.17 -109936.27
## Father -145389.13 -168604.23
## Granddaughter 187039.12 -44850.80
## Grandfather -180116.08 -103668.07
## Grandmother 199145.19 -83038.71
## Grandson -169767.75 -72358.82
## Mother 185797.71 -138677.31
## Nephew -195963.98 173123.81
## Niece 168278.13 208870.31
## Sister 182224.16 -70764.01
## Son -149876.11 -135883.18
## Uncle -205988.70 170100.68
summary(resc)
plot(resc)
bcStressMin
<-bcStressMin(dis,mu=3,lambda=2,nu=2)
resbc resbc
##
## Call:
## bcStressMin(delta = dis, mu = 3, lambda = 2, nu = 2)
##
## Model: Box-Cox Stress MDS
## Number of objects: 15
## Stress-1 value: 0.251
## Number of iterations: 67
plot(resbc)
We saw that these flexible models from above ( power stress, power strain, approximate power stress, local MDS, Box-Cox MDS and their variants) all have hyperparameters \(\theta\); for example, the hyperparameter vector for power stress is \(\theta=(\kappa,\lambda,\rho)\). That begs the question which hyperparameters should be chosenas the chosen values have a huge impact on how the final configuration looks like (clearly a \(kappa=2\) will lead to a different configuration than \(\kappa=0.5\)). So far, we have simply assumed the analyst knows the “right” \(\theta\); indeed often (and as we did so far) they are simply chosen ad hoc.
The main contribution of the stops
package is not solely
in fitting the flexible MDS models for given \(\theta\). The idea behind STOPS is to also
allow to automatically select good hyperparameters \(\theta^*\) to achieve a “structured” MDS
configuration. This can be useful in a variety of contexts: to explore
or generate structures, to restrict the target space, to avoid
artifacts, to preserve certain types of structures and so forth.
For this, the flexible MDS loss functions as described above are augmented to include penalties for the type of structures one is aiming for. This combination of an MDS loss with a structuredness penalty is what we call “structure optimized loss” (stoploss) and the resulting method is coined “Structured Optimized Proximity Scaling” (or STOPS, Rusch, Mair & Hornik, 2023a). STOPS is related to “Cluster Optimized Proximity Scaling” (COPS; Rusch, Mair & Hornik, 2021) which allows to select optimal parameters so that the clusteredness appearance of the configuation is improved (see below).
Following Rusch, Mair & Hornik (2023a) the general idea is that from given observations \(\Delta\) we look for a configuration \(X\). The \(X\) has properties with regards to its structural appearance, which we call c-structuredness for configuration-structuredness. There are different types of c-structuredness people might be interested in (say, how clustered the result is, or that dimensions are orthogonal or if there is some submanifold that the data live on). We developed indices for these types of c-structuredness that capture that essence in the configuration (see Rusch, Mair & Hornik 2023b, or Rusch, Mair & Hornik 2020 for a large list of structures and indices).
We have as part of a STOPS model a proximity scaling loss function \(\sigma_{PS}(\cdot)\) and some transformation \(f_{ij}(\delta_{ij}|\theta_f)\) and \(g_{ij}(d_{ij}|\theta_g)\) and possibly \(h_{ij}(w_{ij}|\theta_h)\). These functions are parametrized and all parameters are collected in the vector \(\theta=(\theta_f,\theta_g,\theta_h)\). \(\theta\) is finite, e.g., a transformation applied to all observations like a power transformation. For example for power stress \(\theta=(\kappa,\lambda,\rho)\) and for Box Cox MDS it is \(\theta=(\mu,\lambda,\rho)\). These transformations achieve a sort of push towards more structure, so different values for \(\theta\) will in general lead to different c-structuredness.
Let us assume we are interested in \(L\) different structural qualities of \(X\) and that we have \(L\) corresponding univariate c-structuredness indices \(I_l(X\vert \gamma)\) for the \(l=1,\dots, L\) different structures, capturing the essence of the structural appearance of the configuration with respect to the \(l\)-th structure. For example, we might be interested in both the structural appearance of how clustered the configuration is (structure 1) and how strongly linearly related the column vectors of the configuration are (structure 2). We then measure the c-structuredness of \(X\) for the two structures with an index for clusteredness and one for linear dependence respectively. The \(\gamma\) are optional metaparameters for the indices, which we assume are given and fixed; they control how c-structuredness is measured. We further assume broadly that the transformations \(f(\Delta\vert\theta_f)\) and/or \(g(D(X)\vert \theta_g)\) and/or \(h(W\vert \theta_h)\) produce different c-structuredness in \(X\) for different values of \(\theta\).
In a nutshell our proposal is to select optimal hyperparameters \(\theta^\ast\) for the scaling procedure by assessing the c-structuredness of an optimal configuration \(X^\ast\) found from a PS method for given \(\theta\) usually in combination with its badness-of-fit value. We aim at finding a \(\theta^\ast\) that, when used as transformation parameters in the PS method, will give a configuration that has high (or low) values of the c-structuredness indices. We view this as a multi-objective optimization problem, where we want to maximize/minimize different criteria (either badness-of-fit, or c-structuredness, or both) over \(\theta\). C-structuredness may this way be induced at a possible expense of fit but we control the expense amount.
To formalize this we explicitly write the building blocks of the objective function used for hyperparameter tuning via STOPS as a function of \(\theta\): Let us denote by \(X^\ast(\theta)\) the optimal solution from minimizing a badness-of-fit \(\sigma_{PS}(X \vert \theta)\) for a fixed \(\theta\), so \(X^\ast(\theta):= \arg\min_{X} \sigma_{PS}(X\vert \theta)\). Further we also have the \(L\) different univariate indices with possible metaparameters \(\gamma\), \(I_l(X^\ast(\theta)\vert \gamma)\) to be optimized for.
Specific variants of STOPS can be instantiated by defining objective functions \(\text{Stoploss}(\theta\vert v_0, \dots, v_L, \gamma)\), comprising either badness-of-fit or c-structuredness indices or both in a scalarized combination. Two variants of objective functions are of the following form:
Additive STOPS (aSTOPS) \[\begin{equation} \label{eq:astops} \text{Stoploss}_{aSTOPS}(\theta\vert v_0, \dots, v_L, \gamma) = v_0 \cdot \sigma_{PS}(X^\ast(\theta)\vert\theta) + \sum^L_{l=1} v_l I_l(X^\ast(\theta)\vert\gamma) \end{equation}\]
Multiplicative STOPS (mSTOPS) \[\begin{equation} \label{eq:mstops} \text{Stoploss}_{mSTOPS}(\theta\vert v_0, \dots, v_L, \gamma) = \sigma_{PS}(X^\ast(\theta)\vert\theta)^{v_0} \cdot \prod^L_{l=1} I_l(X^\ast(\theta)\vert\gamma)^{v_l} \end{equation}\]
with \(v_0 \in \mathbb{R}_{\geq 0}\)
and \(v_1,\dots,v_L \in \mathbb{R}\)
being the scalarization weights that determine how the individual parts
(MDS loss and c-structuredness indices) are aggregated. Note that in
this formulation the aggregation is a sum/product, so the weights must
be negative if a higher index stands for more structure and we want more
of that structure. Alternatively, the stressweight
can be
negative and the strucweight
positive.
Numerically, the badness-of-fit function value \(\sigma_{PS}(X^\ast(\theta)\vert \theta)\) needs to be normalized to be scale-free and commensurable for comparability of different values of \(\theta\). The objective function for aSTOPS is fully compensatory, whereas for mSTOPS it ensures that a normalized badness-of-fit of \(0\) will always lead to the minimal \(\text{Stoploss}(\theta\vert v_0, \dots, v_L, \gamma)\) for a positive value of \(I_l(\cdot)\). For notational convenience, we will refer to the objective functions for STOPS variants by \(\text{Stoploss}(\theta)\) from now on.
The job is then to find \[\begin{equation} \arg\min_{\theta}\ \text{Stoploss}(\theta) \end{equation}\]
Minimizing stoploss can be quite difficult. In stops
we
use a nested algorithm combining optimization that internally first
solves for \(X\) given \(\theta\), \(\arg\min_X
\sigma_{PS}\left(X|\theta\right)\) in an inner loop, and then
optimize over \(\theta\) with a
metaheuristic in an outer loop. The number of iterations of the outer
loop can be controlled with itmax
, the number of iterations
of the inner loop with itmaxps
.
Implemented metaheuristics are simulated annealing
(optimmethod="SANN"
), particle swarm optimization
(optimmethod="pso"
), DIRECT
(optimmethod="DIRECT"
), stochastic global optimization
(optimmethod="stogo"
), COBYLA
(optimmethod="cobyla"
), Controlled Random Search 2 with
local mutation (optimmethod="crs2lm"
), Improved Stochastic
Ranking Evolution Strategy (optimmethod="isres"
),
Multi-Level Single-Linkage (optimmethod="mlsl"
),
Nelder-Mead (optimmethod="neldermead"
), Subplex
(optimmethod="sbplx"
), Hooke-Jeeves Pattern Search
(optimmethod="hjk"
), CMA-ES
(optimmethod="cmaes"
), Bayesian optimization with Gaussian
Process priors and Kriging (optimmethod="Kriging"
),
Bayesian optimization with treed Gaussian processes with jump to linear
models (optimmethod="tgp"
) and Adaptive Luus-Jaakola Search
(optimmethod="ALJ"
). Defaults is “ALJ”.
So there are a lot of solvers to choose from. In our experience
tgp
, ALJ
, Kriging
and
pso
usually work reasonably well for relatively low values
of itmax
, the iterations of the outer loop (up to 20). If
the data are small, then ALJ
and pso
(with s=5
particles) and with relatively high itmax
(>50) is
typically good. If the data are larger (so that solving the PS problem
is becoming very costly) we want to minimize the number of outer
iterations and then Bayesian optimization with tgp
or
Kriging
is typically good for itmax
of about
\(10\). The return of tgp
is diminishing for higher itmax
, so if a higher number of
itmax
can be afforded pso
is often better and
more efficient.
Currently the following c-structuredness types are supported:
cclusteredness
): A
clustered appearance of the configuration (\(I_k\) is the normed OPTICS cordillera
(COPS; Rusch et al. 2015a); 0 means no c-clusteredness, 1 means perfect
c-clusteredness)clinearity
): Projections
lie close to a linear subspace of the configuration (\(I_k\) is maximal multiple correlation; 0
means orthogonal, 1 means perfectly linear)cmanifoldness
):
Projections lie on a sub manifold of the configuration (\(I_k\) is maximal correlation (Sarmanov,
1958); only available for two dimensions; 1 means perfectly smooth
function)cdependence
): Random
vectors of projections onto the axes are stochastically dependent (\(I_k\) is distance correlation (Szekely et
al., 2007); only available for two dimensions; 0 means they are
stochastically independent)cassociation
): Pairwise
nonlinear association between dimensions (\(I_k\) is the pairwise maximal maximum
information coefficient (Reshef et al. 2011), 1 means perfect functional
association)cnonmonotonicity
):
Deviation from monotonicity (\(I_k\) is
the pairwise maximal maximum assymmetry score (Reshef et al. 2011), the
higher the less monotone)cfunctionality
):
Pairwise functional, smooth, noise-free relationship between dimensions
(\(I_k\) is the mean pairwise maximum
edge value (Reshef et al. 2011), 1 means perfect functional
association)ccomplexity
): Measures
the degree of complexity of the functional relationship between any two
dimensions (\(I_k\) is the pairwise
maximal minimum cell number (Reshef et al. 2011), the higher the more
complex)cfaithfulness
): How
accurate is the neighbourhood of \(\Delta\) preserved in \(D\) (\(I_k\) is the adjusted Md index of Chen
& Buja, 2013; note that this index deviates from the others by being
a function of both \(X^*\) and \(\Delta^*\) rather than \(X^*\) alone)cregularity
): How
regular are the objects arranged.chierachy
): Captures how
well a partition/ultrametric (obtained by hclust) explains the
configuration distances.cconvexity
): Measures how
convex the object arrangement is.cstriatedness
):
Measures how striated the object arrangement is.coutlying
): Measures if
there are many outlying points.cskinniness
): Measures
how skinny the object arrangement is.csparsity
): Measures how
sparse the object arrangement is.cstringiness
): Measures
how stringy the object arrangement is.cclumpiness
): Measures
how clumpy the object arrangement is.cinequlaity
): Measures
how unequal the object arrangement is (as it is the Pearson coefficient
of variation)See Rusch, Mair, Hornik (2020) or Rusch, Mair, Hornik (2023b) for a precise definition of each c-structuredness index.
One can also call each index with the function c_foo
where foo stands for the structure, so
e.g. c_stringiness(X)
will give the value of c-stringiness
for object X
. Note that the c-structuredness functions may
have additional metaparameters. For example:
<-res9$conf
Xc_stringiness(X)
## [1] 1
c_clumpiness(X)
## [1] 0.5944404
c_association(X,alpha=0.7,C=10)
## [1] 0.4370104
c_clusteredness(X,minpts=3)
## [1] 0.5546788
If we have a single \(I(X)=OC(X)\), the OPTICS cordillera (Rusch, Hornik & Mair 2018), and the transformations applied are power transformations and the weights for the \(I(X)\) is negative we essentially have P-COPS (see below).
For the MDS loss (argument loss
in functions
stops
), the functions currently support all losses derived
from powerstress, powerstrain, lmds and
bcStress. We offer dedicated functions that either use
workhorses that are more optimized for the problem at hand and/or
restrict the parameter space for the distance/proximity transformations
and thus can be faster. They are:
stress
, smacofSym
: Kruskal’s stress;
Workhorse: smacofSym
, Optimization over \(\theta=\lambda\)smacofSphere
: Kruskal’s stress for projection onto a
sphere; Workhorse smacofSphere
, Optimization over \(\theta=\lambda\)strain
, powerstrain
: Classical scaling;
Workhorse: cmdscale
, Optimization over \(\theta=\lambda\)sammon
, sammon2
: Sammon scaling;
Workhorse: sammon
or smacofSym
, Optimization
over \(\theta=\lambda\)elastic
: Elastic scaling; Workhorse:
smacofSym
, Optimization over \(\theta=\lambda\)sstress
: S-stress; Workhorse:
powerStressMin
, Optimization over \(\theta=\lambda\)rstress
: S-stress; Workhorse:
powerStressMin
, Optimization over \(\theta=\kappa\)powermds
: MDS with powers; Workhorse:
powerStressMin
, Optimization over \(\theta=(\kappa,\lambda)\)powersammon
: Sammon scaling with powers; Workhorse:
powerStressMin
, Optimization over \(\theta=(\kappa,\lambda)\)powerelastic
: Elastic scaling with powers; Workhorse:
powerStressMin
, Optimization over \(\theta=(\kappa,\lambda)\)powerstress
: Power stress model; Workhorse:
powerStressMin
, Optimization over \(\theta=(\kappa,\lambda,\rho)\)rpowerstress
: restricted power stress model; Workhorse:
powerStressMin
, Optimization over \(\theta=(\kappa,\lambda,\rho)\)apstress
: Approximate power stress model; Workhorse:
powerStressMin
, Optimization over \(\theta=(\tau,\upsilon)\)lmds
: LMDS; Workhorse: lmds
, Optimization
over \(\theta=(\tau, k)\).bcstress
: Box-Cox MDS; Workhorse:
bcStressMin
, Optimization over \(\theta=(\mu,\lambda,\rho)\).The syntax for fitting a stops
model is rather
straightforward. One has to supply the arguments dis
which
is a dissimilarity matrix and structures
a character vector
listing the c-structuredness type that should be used to augment the PS
loss (see above for the types of structures and losses). The
metaparameters for the structuredness indices should be given via the
strucpars
argument; it should be a list whose elements are
again lists corresponding to each structuredness index and listing the
parameters (if the default should be used the list element should be set
to NULL
). See the example below. The PS loss can be chosen
with the argument loss
. The type of aggregation for the
multi-objective optimization is specified in type
and can
be one of additive
or multiplicative
. As
starting value for the \(\theta\) one
can supply (theta
). If not this will be a scalar \(1\).
For the outer optimization loop the solver can be specified with
optimmethod
; per default we use "ALJ"
. The
upper
and lower
box constraints for the outer
loop have to be supplied as well (these need to be of the same dimension
as theta
). If theta
was not supplied,
upper
and lower
need to be scalar (so one
value each) and will be recycled to match the length of the
hyperparamater vector. If the maximum number of iterations of the outer
loop needs to be changed, one can use itmax
and if the
inner loop maximum number of iterations (for finding the PS
configuration) needs to be changed, one can use the itmaxps
argument. There are additional arguments for the function and one can
pass additional parameters to the fitting workhorses with
...
(see ?stops
for details).
A typical call looks like
stops(dis, structures = c("cclusteredness","clinearity"), loss="stress", upper=0, lower=1)
The returned object contains the MDS object which is usually a
smacof
or smacofP
class and all S3 methods are
at one’s disposal.
Let us fit an mSTOPS model that looks for a transformation of the \(\delta_{ij}\) so that a) the result has maximal c-clusteredness (which is 1 in the best case, so we set a negative weight for this structure) b) the projection onto the principal axes are nearly orthogonal (c-linearity close to 0, so we set a positive weight for this structure) c) but the projections onto the principal axes should be stochastically dependent (negative weight on c-dependence) and d) the fit of the MDS is also factored in (so we set positive weight on the MDS loss).
We first setup the structures:
<-c("cclusteredness","clinearity","cdependence") structures
featuring the structures we mentioned.
Next we set up the metaparameters for the c-structuredness indices
(the \(gamma\)). This must be a list of
lists. Each list element corresponds to a structure in the same ordering
as in the structures
argument (so first c-clusterednes,
then c-linearity, then c-dependence). Each of the list elements is again
a list with the named arguments that should be supplied to the
structure. If there are no metaparameters for a structure, the list
should be NULL
. Let’s illustrate with an example. For the
OPTICS cordillera (c-lcusteredness) we may want metaparameters
dmax=1
, epsilon=10
and minpts=2
,
so we need to put them in a list with named elements as
list(epsilon=10,minpts=2,dmax=1.3)
. For c-linearity we have
no parameters, so we use NULL
(or list(NULL)
).
For the c-dependence we have a single parameter, index
which we set to 1.2 in another named list as
list(index=1.2)
. We then create a list with these list as
elements:
<-list(
strucparslist(epsilon=10,minpts=2,dmax=1.3), # element 1: list of arguments for c-clusteredness
NULL, # element 2: argument for c-linearity (empty as it has no parameters)
list(index=1.2) # element 3: list of arguments for c-dependence
)
If some arguments are not listed, the default values are taken. If there’s only one structre, we don’t need a list of list but the list of metaparameters suffices.
Since we use mSTOPS and a negative weight for c-linearity and c-dependence, a c-linearity/c-dependence close to 0 will overall dominate the stoploss function with the other two criteria being more of an afterthought - in aSTOPS that would be different. We weight all of them equally (\(0.33\)).
[!!]: This is generally the approach to be chosen: We minimize the stoploss, so a c-structuredness index that should be (numerically) large needs a negative weight and a c-structuredness index that should be (numerically) small needs a positive weight.
We now run stops
.
set.seed(666)
<-stops(kinshipdelta,loss="stress",stressweight=1,structures=c("cclusteredness","clinearity","cdependence"),strucweight=c(-0.33,0.33,-0.33),verbose=0,strucpars=strucpars,type="multiplicative",lower=0,upper=8)
ressm ressm
##
## Call: stops(dis = kinshipdelta, loss = "stress", structures = c("cclusteredness",
## "clinearity", "cdependence"), stressweight = 1, strucweight = c(-0.33,
## 0.33, -0.33), strucpars = strucpars, lower = 0, upper = 8,
## verbose = 0, type = "multiplicative")
##
## Model: multiplicative STOPS with stress loss function and theta parameter vector (lambda) = 1.432493
##
## Number of objects: 15
## MDS loss value: 0.06970603
## C-Structuredness Indices: cclusteredness 0.24108469 clinearity 0.02696876 cdependence 0.20524114
## Structure optimized loss (stoploss): 0.05705468
## MDS loss weight: 1 c-structuredness weights: -0.33 0.33 -0.33
## Number of iterations of ALJ optimization: 28
The print function gives us information about the model. The selected
hyperparameter was \(\lambda^*=\) NA.
With that and the given metaparameters of the indices we have values of
0.2410847 for c-clusteredness, 0.0269688 for c-linearity and 0.2052411
for c-dependence. Stress ($stress.m
) was 0.069706 and
stress-1 ($stress
, the square root of raw stress) is
sqrt(0.069)=0.26.
We plot the obtained configuration with the “optimal” \(\lambda^*\).
plot(ressm)
We can see that the configuration mimics the structuredness we wanted (there is clusteredness, i.e., the terms are arranged in clusters, the projection is near orthogonal but there is a stochastic dependence that is nonlinear. This is a star-shaped arrangement. (No, this is Patrick).
Let us compare this with the corresponding aSTOPS
<-stops(kinshipdelta,loss="stress",stressweight=1,structures=c("cclusteredness","clinearity","cdependence"),strucweight=c(-0.33,0.33,-0.33),verbose=0,strucpars=strucpars,type="additive",lower=0,upper=8)
ressa ressa
##
## Call: stops(dis = kinshipdelta, loss = "stress", structures = c("cclusteredness",
## "clinearity", "cdependence"), stressweight = 1, strucweight = c(-0.33,
## 0.33, -0.33), strucpars = strucpars, lower = 0, upper = 8,
## verbose = 0, type = "additive")
##
## Model: additive STOPS with stress loss function and theta parameter vector (lambda) = 1.483125
##
## Number of objects: 15
## MDS loss value: 0.07037108
## C-Structuredness Indices: cclusteredness 0.24482385 clinearity 0.02667558 cdependence 0.20393552
## Structure optimized loss (stoploss): -0.06891657
## MDS loss weight: 1 c-structuredness weights: -0.33 0.33 -0.33
## Number of iterations of ALJ optimization: 25
plot(ressa)
Not really different here.
When choosing a c-structuredness index, one needs to be clear what
structure one might be interested in and how it interacts with the PS
loss chosen. Consider the following example: We fit a
powermds
model to the kinship data and want to maximize
c-association (i.e., any non-linear relationship) and c-manifoldness but
minimize the c-linearity. In other words we try to find a power
transformation of \(\Delta\) and \(D\) so that the objects are positioned in
the configuration in such a way that the projection onto the principal
axes are as close as possible to being related by a smooth but
non-linear function.
We use tgp
as optimization method for
itmax=10
steps and restrict the search space to be between
\(0\) and \(5\). The maximum number of iterations to
get the PS configuration is set to itmaxps=1000
.
set.seed(666)
<-stops(kinshipdelta,structures=c("cassociation","cmanifoldness","clinearity"),loss="powermds",theta=c(1,1),strucpars=list(NULL,NULL,NULL),type="additive",strucweight=c(-0.5,-0.5,0.5),lower=c(0,0),upper=c(5,5),optimmethod="tgp",itmaxps=1000,itmax=10) resa
resa
##
## Call: stops(dis = kinshipdelta, loss = "powermds", theta = c(1, 1),
## structures = c("cassociation", "cmanifoldness", "clinearity"),
## strucweight = c(-0.5, -0.5, 0.5), strucpars = list(NULL,
## NULL, NULL), optimmethod = "tgp", lower = c(0, 0), upper = c(5,
## 5), type = "additive", itmax = 10, itmaxps = 1000)
##
## Model: additive STOPS with powermds loss function and theta parameter vector (kappa lambda) = 0.4810473 4.443448
##
## Number of objects: 15
## MDS loss value: 0.2470057
## C-Structuredness Indices: cassociation 0.515506235 cmanifoldness 0.956142079 clinearity 0.001926927
## Structure optimized loss (stoploss): -0.487855
## MDS loss weight: 1 c-structuredness weights: -0.5 -0.5 0.5
## Number of iterations of tgp optimization: 10
We see in this model (resa
) that indeed the
c-manifoldness is almost at 1, which suggests we have the objects
arranged close to a manifold structure. c-association is also there and
clinearity is near 0, so we see that this is a linear structure. How
does this relationship look like?
plot(resa)
The manifold resembles an oval and the objects are arranged in clusters close to that oval. Due to the oval shape the c-association isn’t high as that isn’t really what c-association measures (it measures non-linear curves that do not turn whereas manifoldness ignores bending).
What we can also see is that there are three clear clusters of 2 or more objects, but we didn’t select for that.
Note that it may just as well be possible to have a high
c-association and no c-clusteredness at all (e.g., points lying
equidistant on a smooth non-linear curve), as we did not optimize for
the latter. The transformation in powermds
that is optimal
with respect to c-clusteredness could be different.
Indeed one can even optimize for c-clusteredness alone (setting
stressweight=0
) and using it as a
“goodness-of-clusteredness” index (i.e., we let the \(d_{max}\) vary between the configurations
and take the highest attainable normed cordillera). This time we use
optimmethod="tgp"
and itmax=20
.
set.seed(666)
<-stops(kinshipdelta,structures=c("cclusteredness"),loss="powermds",strucpars=list(list(epsilon=10,dmax=NULL,minpts=2)),type="additive",strucweight=-1,stressweight=0,lower=0.1,upper=10,optimmethod="tgp",itmax=20)
resa2 resa2
##
## Call: stops(dis = kinshipdelta, loss = "powermds", structures = c("cclusteredness"),
## stressweight = 0, strucweight = -1, strucpars = list(list(epsilon = 10,
## dmax = NULL, minpts = 2)), optimmethod = "tgp", lower = 0.1,
## upper = 10, type = "additive", itmax = 20)
##
## Model: additive STOPS with powermds loss function and theta parameter vector (kappa lambda) = 1.593239 1.593239
##
## Number of objects: 15
## MDS loss value: 0.1081298
## C-Structuredness Indices: cclusteredness 0.7194585
## Structure optimized loss (stoploss): -0.7194585
## MDS loss weight: 0 c-structuredness weights: -1
## Number of iterations of tgp optimization: 20
Then we get a projection with c-clusteredness of 0.7194585. This can of course be done for an c-structuredness.
It is also possible to use the stops
function for
finding the hyperparameters that are optimal for minimizing the
badness-of-fit for a given transformation class in the the non-augmented
models specified in loss
, by setting the
strucweight
(the weight of any c-structuredness) to 0. Then
the function optimizes the MDS loss function only. We use
optimmethod="pso"
with 5 particles and 100 steps.
<-stops(kinshipdelta,theta=1,structures=c("clinearity"),strucweight=0,loss="stress",verbose=0,upper=8,lower=0,optimmethod="pso",itmax=100) ressa
Here the minimum badness-of-fit by subjecting the \(\delta\) to a power transformation \(\delta^*=\delta^\lambda\) is obtained with \(\lambda^*=\) and a stress-1 of 0.2652755. This strategy is similar to how the optimal scaling of dhats works within other MDS methods (e.g., in nonmetric and spline MDS).
A special STOPS model is P-COPS (Rusch, Mair & Hornik 2021). Along the lines of STOPS the overall objective function, which we call , is a weighted combination of the \(\theta-\)parametrized loss function, \(\sigma_{PS}\left(X(\theta)|\theta\right)\), and a c-clusteredness measure, the OPTICS cordillera or \(OC(X(\theta);\epsilon,k,q)\) to be optimized as a function of \(\theta\) or \[\begin{equation} \label{eq:spstress} \text{coploss}(\theta) = v_1 \cdot \sigma_{PS}\left(X(\theta) | \theta \right) - v_2 \cdot \text{OC}\left(X(\theta);\epsilon,k,q\right) \end{equation}\] with \(v_1,v_2 \in \mathbb{R}\) controlling how much weight should be given to the scaling fit measure and the c-clusteredness.
The c-clusteredness index we use is the OPTICS cordillera which measures how clustered a configuration appears. It is based on the OPTICS algorithm that outputs an ordering together with a distance. The OPTICS cordillera is now simply an agregation of that information. Since we know what constitutes a maximally clustered result, we can derive an upper bound and normalize the index to lie between 0 and 1. If it is maximally clustered, the index gets a value of 1,and it gets 0 if all points are equidistant to their nearest neighbours (a matchstick embedding). See the vignette in the COPS package for more.
We recreate parts of an example of Rusch, Mair and Hornik (2023)
here. The data are 500 handwritten pen digits and we subject them to
different flexible MDS methods and select hyperparameters with STOPS
based on c-clusteredness and c-manifoldness. We use Sammon mapping with
power transformations for the proximities, Box-Cox MDS, lMDS (all three
like in the article) and also approximate power stress (different from
the article). The data are available in the package via
data(Pendigits500)
.
This is a tough data set for our implementations because it is quite large, so an individual MDS takes a while. When running STOPS we have to fit a large number of MDS, so this can take quite a long time and therefore we also time so that users can see what to expect.
# data setup
data(Pendigits500)
<- Pendigits500
pendss names(pendss)[17] <- "digit"
## setup data and the c-structuredness hyperparameters
<- dist(pendss[,1:16]) #only features are used to get the dissimilarity matrix, not the digit label
dis
# parameters for OC (c-clusteredness)
<- 2
q <- c(0,0.6)
rang <- 5
minpts <- 10
eps <- 3 scale
We start with an initial solution (a Sammon mapping) and see how the c-structuredness of that result is
## initial sammon mapping solution
<- sammon(dis) initsam
## Initial stress : 0.15233
## stress after 1 iters: 0.15079
## c-structuredness of initial solution
<- cordillera::cordillera(initsam$points,minpts=minpts,q=q,epsilon=eps,rang=rang,scale=scale)
c1 <- c_manifoldness(initsam$points)
ccom1 c1
## raw normed
## 0.70519137 0.08331615
ccom1
## [1] 0.593913
So the c-clusteredness was 0.0833162 and c-manifoldness was 0.593913. This is our reference solution to see if we can improve c-structuredness.
Next we set up the structures and the parameters for the structuredness indices
# #c-structuredness metaparameters
<- list(list(minpts=minpts,epsilon=eps,rang=rang,scale=scale), #cordillera
strucpars list(alpha=0.9,C=15,var.thr=1e-5,zeta=NULL) #manifoldness
) <- c("cclusteredness","cmanifoldness") #structures structures
We use weights of \(-400\) for c-clusteredness and \(-2.5\) for c-manifoldness.
## c-structuredness weights
<- c(-400,-2.5) strucweight
Now we have everything to run STOPS. We use Sammon mapping with
powers for the dissimilarities, local MDS and Box-Cox MDS. The search
space is from \(1\) to \(6\) and we use
optimmethod='tgp'
(one can set verbose=4
to
see details of the iteration progress). We also time the
calculations.
#Sammon with power transformations of proximities
set.seed(666)
<- system.time(restgp <- stops(dis,theta=5.6,loss="sammon",structures=structures,stressweight=1,strucweight=strucweight,strucpars=strucpars,lower=1,upper=6,optimmethod="tgp",model="btgpllm",itmax=10))
timesam
#Box Cox MDS
set.seed(667)
<- system.time(restgpbc <- stops(dis,theta=c(4.63,2.77,5.12),loss="bcstress",structures=structures,stressweight=1,strucweight=strucweight,strucpars=strucpars,lower=c(0.1,0.1,0.1),upper=c(6,6,6),optimmethod="tgp",model="btgpllm",itmax=10,itmaxps=1e4))
timebc
#lMDS
set.seed(669)
<- system.time(restgplmds <- stops(dis,theta=c(14.5,3),loss="lmds",weightmat=dis,structures=structures,stressweight=1,strucweight=strucweight,strucpars=strucpars,lower=c(5,0),upper=c(50,3),optimmethod="tgp",model="btgpllm",itmax=10,itmaxps=1e4))
timelmds
#approx p stress
set.seed(670)
<- system.time(restgpaps <- stops(dis,theta=c(0.53,2.85),loss="apstress",structures=structures,stressweight=1,strucweight=strucweight,strucpars=strucpars,lower=c(0.1,0.1),upper=c(6,6),optimmethod="ALJ",itmax=20,itmaxps=1e4)) timeaps
We now plot the results. We color the digits differently and adjust all the configurations to the same plotting region via Procrustes transformation (as MDS results are invariant to translation, scaling and rotation).
#plot the digit results
<- cols1 <- colstraj <- factor(pendss[,17])
colsopt levels(colsopt) <- hcl.colors(10,"Dark 3")
#Procrustes adjustment to lmds configuration for STOPS optimized
<- restgp$fit$conf
samconf <- restgplmds$fit$conf
lmdsconf <- restgpbc$fit$conf
bcconf <-restgpaps$fit$conf
apsconf<- Procrustes(lmdsconf,samconf)$Yhat
samconf <- Procrustes(lmdsconf,bcconf)$Yhat
bcconf <- Procrustes(lmdsconf,apsconf)$Yhat apsconf
And here are the plots.
par(mfrow=c(2,2))
#Sammon
plot(samconf[,1],samconf[,2],col=as.character(colsopt),type="n",xlab="",ylab="",asp=1,axes=TRUE,main="Sammon")
points(samconf[,1],samconf[,2],col=as.character(colsopt),pch=as.character(pendss[,17]))
box()
#lmds
plot(lmdsconf[,1],lmdsconf[,2],col=as.character(colsopt),type="n",xlab="",ylab="",asp=1,main="lMDS",axes=TRUE)
points(lmdsconf[,1],lmdsconf[,2],col=as.character(colsopt),pch=as.character(pendss[,17]))
box()
#bc stress
plot(bcconf[,1],bcconf[,2],col=as.character(colsopt),type="n",xlab="",ylab="",asp=1,axes=TRUE,main="BC-MDS")
points(bcconf[,1],bcconf[,2],col=as.character(colsopt),pch=as.character(pendss[,17]))
box()
#ap stress
plot(apsconf[,1],apsconf[,2],col=as.character(colsopt),type="n",xlab="",ylab="",asp=1,axes=TRUE,main="Approx. POST-MDS")
points(apsconf[,1],apsconf[,2],col=as.character(colsopt),pch=as.character(pendss[,17]))
box()
We can see that Sammon mapping with power transformation shows a clusteredness that is both high and sensible. The values is 0.1500707 for Sammon. Box-Cox MDS also give s arelatively clustered result, but it seems it finds 3 clusters and that is less in line wht the ground truth. The value is 0.1240366.
restgp
##
## Call: stops(dis = dis, loss = "sammon", theta = 5.6, structures = structures,
## stressweight = 1, strucweight = strucweight, strucpars = strucpars,
## optimmethod = "tgp", lower = 1, upper = 6, verbose = 4, itmax = 10,
## model = "btgpllm")
##
## Model: additive STOPS with sammon loss function and theta parameter vector (lambda) = 5.399645
##
## Number of objects: 500
## MDS loss value: 0.6970387
## C-Structuredness Indices: cclusteredness 0.1500707 cmanifoldness 0.9280689
## Structure optimized loss (stoploss): -61.6514
## MDS loss weight: 1 c-structuredness weights: -400 -2.5
## Number of iterations of tgp optimization: 10
restgpbc
##
## Call: stops(dis = dis, loss = "bcstress", theta = c(4.63, 2.77, 5.12),
## structures = structures, stressweight = 1, strucweight = strucweight,
## strucpars = strucpars, optimmethod = "tgp", lower = c(0.1,
## 0.1, 0.1), upper = c(6, 6, 6), verbose = 4, itmax = 10,
## itmaxps = 10000, model = "btgpllm")
##
## Model: additive STOPS with bcstress loss function and theta parameter vector (mu lambda rho) = 4.631187 4.85914 1.285368
##
## Number of objects: 500
## MDS loss value: 0.1293095
## C-Structuredness Indices: cclusteredness 0.1240366 cmanifoldness 0.9610854
## Structure optimized loss (stoploss): -51.88803
## MDS loss weight: 1 c-structuredness weights: -400 -2.5
## Number of iterations of tgp optimization: 10
Local MDS and approximate power stress give configurations that are less clustered, although the arrangement of points is still sensible even if they don’t align in clusters.
restgpbc
##
## Call: stops(dis = dis, loss = "bcstress", theta = c(4.63, 2.77, 5.12),
## structures = structures, stressweight = 1, strucweight = strucweight,
## strucpars = strucpars, optimmethod = "tgp", lower = c(0.1,
## 0.1, 0.1), upper = c(6, 6, 6), verbose = 4, itmax = 10,
## itmaxps = 10000, model = "btgpllm")
##
## Model: additive STOPS with bcstress loss function and theta parameter vector (mu lambda rho) = 4.631187 4.85914 1.285368
##
## Number of objects: 500
## MDS loss value: 0.1293095
## C-Structuredness Indices: cclusteredness 0.1240366 cmanifoldness 0.9610854
## Structure optimized loss (stoploss): -51.88803
## MDS loss weight: 1 c-structuredness weights: -400 -2.5
## Number of iterations of tgp optimization: 10
restgpaps
##
## Call: stops(dis = dis, loss = "apstress", theta = c(0.53, 2.85), structures = structures,
## stressweight = 1, strucweight = strucweight, strucpars = strucpars,
## optimmethod = "ALJ", lower = c(0.1, 0.1), upper = c(6, 6),
## verbose = 4, itmax = 20, itmaxps = 10000)
##
## Model: additive STOPS with apstress loss function and theta parameter vector (tau upsilon) = 0.5091412 2.842567
##
## Number of objects: 500
## MDS loss value: 0.05392559
## C-Structuredness Indices: cclusteredness 0.08691348 cmanifoldness 0.81720840
## Structure optimized loss (stoploss): -36.75449
## MDS loss weight: 1 c-structuredness weights: -400 -2.5
## Number of iterations of ALJ optimization: 20
This brings us to c-manifoldness which is very high for Sammon, Box-Cox MDS and lMDS. Approximate power stress MDS (POST-MDS) shows c-manifoldness that is a bit less bit still relatively high (it is less mainly due ot being more noisy around the imagined manifold). The manifolds are alos easy to see: for Sammon it seems to be reminiscent of a spiral, for approximate power stress it is a circle, for Box-Cox MDS it is a circle as well (or three spherical clusters) and for lMDS a manifold is also apparent although I don’t think it has a name.
The interesting thing here is that the objects are arrnaged along the manifolds in a way that suggests a structure of how handwritten digits are related, e.g., that 0s and 8s are related and that they are closer to 5s than to 1s. There is also a way to get from one digit to others along the manifold, e.g, in approx. POST-MDS one would get from the 6s to over the 9s to the 3s. This may give a hypothesis about a relationship amongst them.
We also briefly describe some other functions in the package
stops
.
Since the inner optimization problem in STOPS models is hard and
takes long, Rusch, Mair and Hornik (2021) developed a metaheuristic for
the outer optimization problem that typically needs less calls to the
inner minimization than SANN
, albeit without the guarantees
of convergence to a global minimum for non-smooth functions. It is an
adaptation of the Luus-Jaakola random search (Luus & Jaakola 1973).
It can used with the function ljoptim
which modeled its
output after optim
. It needs as arguments x
a
starting value, fun
a function to optimize, a
lower
and upper
box constraint for the search
region. By using the argument adaptive=TRUE
or
FALSE
one can switch between our adaptive version and the
original LJ algorithm. Accuracy of the optimization can be controlled
with the maxit
(maximum number of iterations),
accd
(terminates after the length of the search space is
below this number ) and acc
arguments (terminates if
difference of two subsequent function values are below this value).
We optimize a “Wild Function” with the non-adaptive LJ version (and
numerical accuracies of at least 1e-16
for
accd
and acc
).
set.seed(210485)
<- function (x) 10*sin(0.3*x)*sin(1.3*x^2) + 0.00001*x^4 + 0.2*x+80
fwild <-ljoptim(50, fwild,lower=-50,upper=50,adaptive=FALSE,accd=1e-16,acc=1e-16)
res2 res2
## $par
## [1] -15.81515
##
## $value
## [1] 67.46773
##
## $counts
## function gradient
## 463 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
plot(fwild, -50, 50, n = 1000, main = "ljoptim() minimising 'wild function'")
points(res2$par,res2$value,col="red",pch=19)
We also provide a procrustes adjustment to make two configurations
visually comparable. The function is conf_adjust
and takes
two configurations conf1
the reference configuration and
conf2
another configuration. It returns the adjusted
versions
conf_adjust(conf1,conf2)
Borg I, Groenen PJ (2005). Modern multidimensional scaling: Theory and applications. 2nd edition. Springer, New York
Buja A, Swayne DF, Littman ML, Dean N, Hofmann H, Chen L (2008). Data visualization with multidimensional scaling. Journal of Computational and Graphical Statistics, 17 (2), 444-472.
Chen L, Buja A (2013). Stress functions for nonlinear dimension reduction, proximity analysis, and graph drawing. Journal of Machine Learning Research, 14, 1145-1173.
Chen L, Buja A (2009). “Local multidimensional scaling for nonlinear dimension reduction, graph drawing, and proximity analysis.” Journal of the American Statistical Association, 104(485), 209–219.
De Leeuw J (2014). Minimizing r-stress using nested majorization. Technical Report, UCLA, Statistics Preprint Series.
De Leeuw J, Mair P (2009). Multidimensional Scaling Using Majorization: SMACOF in R. Journal of Statistical Software, 31 (3), 1-30.
Kruskal JB (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29 (1), 1-27.
Luus R, Jaakola T (1973). Optimization by direct search and systematic reduction of the size of search region. American Institute of Chemical Engineers Journal (AIChE), 19 (4), 760-766.
McGee VE (1966). The multidimensional analysis of ‘elastic’ distances. British Journal of Mathematical and Statistical Psychology, 19 (2), 181-196.
Reshef D, Reshef Y, Finucane H, Grossman S, McVean G, Turnbaugh P, Lander E, Mitzenmacher M, Sabeti P (2011). Detecting novel associations in large datasets. Science, 334, 6062.
Rosenberg S, Kim MP (1975). The method of sorting as a data gathering procedure in multivariate research. Multivariate Behavioral Research, 10, 489-502.
Rusch T, Hornik K, Mair P (2018). Assessing and Quantifying Clusteredness: The OPTICS Cordillera, Journal of Computational and Graphical Statistics, 27:1, 220-233.
Rusch T, Mair P, Hornik K (2020). STOPS: Structure Optimized Proximity Scaling. Discussion Paper Series / Competence Center for Empirical Research Methods, No 2020/1, WU Vienna, Austria.
Rusch T, Mair P, Hornik K (2021) Cluster Optimized Proximity Scaling. Journal of Computational and Graphical Statistics, 30:4, 1156-1167.
Rusch T, Mair P, Hornik K (2023a). Structure-based hyperparameter selection with Bayesian optimization in multidimensional scaling. Statistics & Computing, 33:28.
Rusch T, Mair P, Hornik K (2023b). Supplement to “Structure-based hyperparameter selection with Bayesian optimization in multidimensional scaling”. Statistics & Computing, 33:28.
Sammon JW (1969). A nonlinear mapping for data structure analysis. IEEE Transactions on Computers, 18 (5), 401-409
Sarmanov OV (1958) The maximum correlation coefficient (symmetric case). Dokl. Akad. Nauk SSSR, 120 : 4 (1958), 715 - 718.
Székely GJ, Rizzo ML, Bakirov NK (2007). Measuring and testing independence by correlation of distances, The Annals of Statistics, 35:6, 2769–2794.
Takane Y, Young F, de Leeuw J (1977). Nonmetric individual differences multidimensional scaling: an alternating least squares method with optimal scaling features. Psychometrika, 42 (1), 7-67.
Torgerson WS (1958). Theory and methods of scaling. Wiley.
Venables WN, Ripley BD (2002). Modern Applied Statistics with S. Fourth edition. Springer, New York.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.