The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
The traj
package implements a clustering algorithm for
functional data (henceforth referred to as trajectories) that extends
previous work from Leffondré et al [1]. This algorithm is comprised of
three steps. The first step summarizes the main features of the
trajectories by computing the 18 measures listed below and detailed in
Appendix A.
The second step performs a dimensionality reduction on the 18 measures to extract the main features of the trajectories. Specifically,
In the third step, a clustering algorithm is used to form clusters of trajectories based on the measures selected in step 2.
Let us illustrate how to use the traj
package on an
artificially created dataset (trajdata
) comprised of 130
trajectories following four distinct patterns (A, B, C, D).
library(traj)
data(trajdata)
head(trajdata)
dat <- trajdata[, -c(1,2)]
Each trajectory is made up of six observations and there are no
missing values. The function Step1Measures
computes the
measures. By default, measure 18 (Early change/Later change) is not
included in the analysis. This is because, depending on the situation
(uneven observation times, attrition, missing values, etc.), there might
not be, for each trajectory, a natural midpoint. In the present data
set, we include measure 18. By leaving the ‘midpoint’ argument at its
default of NULL
, the third observation will be taken as the
midpoint.
step1 <- Step1Measures(Data = dat, measures = 1:18)
summary(step1)
## Description of the measures:
## m1: Maximum
## m2: Range
## m3: Mean value
## m4: Standard deviation
## m5: Slope of the linear model
## m6: Proportion of variance explained by the linear model (R squared)
## m7: Curve length (total variation)
## m8: Number of times crossing the mean per unit time
## m9: Proportion of time spent under the mean
## m10: Minimum of the first derivative
## m11: Maximum of the first derivative
## m12: Mean of the first derivative
## m13: Standard deviation of the first derivative
## m14: Minimum of the second derivative
## m15: Maximum of the second derivative
## m16: Mean of the second derivative
## m17: Standard deviation of the second derivative
## m18: Later change/Early change
##
## Summary of measures:
## m1 m2 m3 m4 m5 m6
## Min. 11.16944 4.974832 6.734208 1.655156 -24.7364594 6.548737e-05
## 1st Qu. 17.44338 13.596531 10.683520 4.608162 -18.0088233 2.119554e-01
## Median 86.89627 80.567344 49.521374 24.392691 -14.3756056 8.825381e-01
## Mean 63.71865 61.281725 37.032177 18.761021 -8.8673769 6.408343e-01
## 3rd Qu. 95.64244 94.429695 55.289514 29.552200 0.1682977 9.319712e-01
## Max. 113.38473 128.477551 61.923002 37.161789 16.2687616 9.903453e-01
## m7 m8 m9 m10 m11 m12
## Min. 11.01804 0.2000000 0.2000000 -63.352715 -14.935763 -25.69551019
## 1st Qu. 33.25262 0.2000000 0.5000000 -33.047714 -5.681113 -18.46646486
## Median 81.92250 0.2000000 0.5000000 -21.252427 2.160407 -14.91259635
## Mean 70.27963 0.3507692 0.5261538 -20.647836 2.202651 -8.98945884
## 3rd Qu. 102.00426 0.6000000 0.6000000 -5.955160 6.702146 0.04639682
## Max. 137.85654 1.0000000 0.9000000 9.971951 44.466830 17.57402397
## m13 m14 m15 m16 m17 m18
## Min. 1.243051 -27.917171 -7.326749 -12.5077887 0.6418593 -177.05043991
## 1st Qu. 4.101053 -12.992123 1.919339 -4.0529737 3.1900188 -0.01697665
## Median 6.502801 -9.027665 4.740119 -1.5795261 4.8789085 1.12915597
## Mean 7.207440 -9.280153 6.277813 -1.6084247 5.2729286 1.11919749
## 3rd Qu. 9.685304 -4.659910 10.095925 0.9862174 6.8043731 2.81539207
## Max. 18.868623 2.347271 27.813294 11.9407699 15.3664412 95.15142933
Once the measures are computed, we use the
Step2Selection
function to extract the measures that best
characterize the trajectories.
step2 <- Step2Selection(trajMeasures = step1)
summary(step2)
## The measures m4, m12 were discarded because they were perfectly or almost perfectly correlated with another measure. Upon forming the principal components from the remaining measures, 4 of them had a variance greater than any one of the normalized measures. Together, they explained 82.7% of the total variance. A varimax rotation was performed to maximize the correlation with the original measures without affecting the proportion of explained variance. For each rotated factor, the measure that had the highest correlation (loading) with it was selected. As a result of this procedure, the selected measures are, in decreasing order of variance explained, m6, m14, m11, m15. Use print() to see more detailed informations.
Two measures are defined as “perfectly or almost perfectly correlated” if the absolute value of their Pearson correlation coefficient is greater than 0.98. The print function provides more detailed information:
print(step2)
## m4 has been discarded due to being perfectly or almost perfectly correlated with m2.
## m12 has been discarded due to being perfectly or almost perfectly correlated with m5.
##
## In decreasing order of variance explained, the selected measures are m6, m14, m11, m15.
##
## Loadings:
## RC1 RC4 RC3 RC2
## m1 0.847 0.381 -0.290
## m2 0.788 0.483 -0.338
## m3 0.878 0.362 -0.190
## m5 -0.382 -0.361 0.811
## m6 0.901 0.206 -0.246
## m7 0.697 0.597 -0.324
## m8 -0.860 0.144
## m9 0.416 0.233 0.170
## m10 -0.355 -0.575 0.683
## m11 0.952
## m13 0.445 0.801 -0.104
## m14 -0.257 -0.837 0.364
## m15 0.231 0.947
## m16 -0.168 -0.415 0.868
## m17 0.239 0.826 0.442
## m18 0.184 -0.135
##
## RC1 RC4 RC3 RC2
## SS loadings 4.963 3.667 2.544 2.052
## Proportion Var 0.310 0.229 0.159 0.128
## Cumulative Var 0.310 0.539 0.698 0.827
Measure 4 (Standard deviation) was dropped because it is perfectly or
almost perfectly correlated with measure 2 (Range) and measure 12 (Mean
of the first derivative) was dropped because it is perfectly or almost
perfectly correlated with measure 5 (Slope of the linear model). The
Step3Clusters
function uses the k-medoids algorithm
(function cluster:::pam
) on the measures selected in step 2
to cluster the trajectories.
library(cluster)
set.seed(1337)
step3 <- Step3Clusters(trajSelection = step2)
If the nclusters
argument is set to NULL
(the default), the optimal number \(k\)
of clusters will be determined using the Calinski-Harabasz criterion. In
our case, this gives \(k=4\) cluster as
can be inspected visually with
par(mfrow = c(1, 1))
plot(step3, which.plots = 1, ask = FALSE)
To visually inspect the classification, we write
plot(step3, ask = TRUE)
. We can also ask for specific plots
with which.plots
.
The “Sample trajectories” plot tends to get cluttered when there are too many clusters. In any case, it is always a good idea to plot the whole clusters:
color.pal <- palette.colors(palette = "Okabe-Ito", alpha = 1)
par(mfrow = c(1, 1))
for(k in 1:4){
w <- which(step3$partition$Cluster == k)
dat.w <- dat[w, ]
plot(y = 0, x = 0, ylim = c(floor(min(dat)), ceiling(max(dat))), xlim = c(1,6), xlab="", ylab="", type="n", main = paste("Cluster ", k, " (n = ", step3$partition.summary[k], ")", sep = ""))
for(i in 1:length(w)){
lines(y = dat.w[i, ], x = 1:6, col = color.pal[k])
}
}
In this section, we expand on how the eighteen measures are computed. Let \(y=y(t)\) denote a continuous function \([a,b]\rightarrow \mathbb{R}\) and let \(y(t_i)\) denote the trajectory obtained by measuring \(y(t)\) at times \(a\leq t_1<\ldots< t_N\leq b\), where \(N\geq 3\). We do not assume that the times \(t_i\) are equidistant from one another.
m1: Maximum. This is \[\max_iy(t_i)\]
m2: Range. This is \[\max_iy(t_i) - \min_iy(t_i)\]
m3: Mean value. This measure is defined by the formula
\[ \mathrm{m3}=\frac{1}{t_N-t_1}\sum_{i=1}^{N-1}\frac{y(t_i)+y(t_{i+1})}{2}(t_{i+1}-t_i). \]
m4: Standard deviation. This measure is given by the formula
\[ \mathrm{m4} = \sqrt{\frac{1}{t_N-t_1}\sum_{i=1}^{N-1}\frac{\left(y(t_i)-\mathrm{m3}\right)^2 + \left(y(t_{i+1})-\mathrm{m3}\right)^2}{2}(t_{i+1}-t_i)}. \]
m5: Slope of the linear model. Here the \(y(t_i)\) are regressed against the \(t_i\) in the linear model \(y(t_i) = \beta_0 + \beta_1t_i+\epsilon_i\) using the method of least squares and m5 is defined as \(\hat{\beta}_1\).
m6: Proportion of variance explained by the linear model (R squared). This is the coefficient of determination of the linear model used to define m5.
m7: Curve length (total variation). This measure is given by the formula
\[ \mathrm{m7} = \sum_{i=1}^{N-1}\sqrt{(t_{i+1} - t_i)^2 + (y(t_{i+1}) - y(t_i))^2}. \]
m8: Rate of intersection with the mean. For each \(i=1,\ldots,N-1\), let \(y_0(t_i) = y(t_i) -\mathrm{m3}\) and set
\[ \chi_i=\left\{ \begin{array}{cc} 1 & \text{if $y_0(t_{i})\neq 0$ and $\mathrm{sgn}(y_0(t_{i})\times y_0(t_{j}))=-1$ for $j$ the smallest index with $j>i$ and $y_0(t_j)\neq 0$} \\ 0 & \text{otherwise} \end{array} \right. , \]
\[ \mathrm{m8} = \frac{1}{t_N-t_1}\sum_{i=1}^{N-1}\chi_i. \]
m9: Proportion of time spent above the mean. Again, let \(y_0(t_i) =y(t_i)-\mathrm{m3}\) and set
\[ T^+=\frac{t_2 - t_1}{2}\mathbb{I}(y_0(t_1)>0) + \sum_{i=2}^{N-1}\frac{t_{i+1} - t_{i-1}}{2}\mathbb{I}(y_0(t_i)>0) + \frac{t_N - t_{N-1}}{2}\mathbb{I}(y_0(t_N)>0), \]
\[ T^-=\frac{t_2 - t_1}{2}\mathbb{I}(y_0(t_1)<0) + \sum_{i=2}^{N-1}\frac{t_{i+1} -t_{i-1}}{2}\mathbb{I}(y_0(t_i)<0) + \frac{t_N - t_{N-1}}{2}\mathbb{I}(y_0(t_N)<0), \]
\[ \mathrm{m9} = \frac{T^+}{T^- + T^+}. \]
In the event that both the numerator and denominator of m9 are 0, m9 is set to 1.
m10: Minimum of the first derivative. Measures 10-13 concern \(y'(t)\), the first derivative of \(y(t)\). The trajectory, \(y'(t_i)\) is approximated from the data as follows:\[\widehat{y'}(t_i)= \left\{ \begin{array}{cc} \Delta_i^+ & \text{if $i=1$} \\ w_i^-\Delta^-_i + w_i^+\Delta^+_i & \text{if $1<i<N$} \\ \Delta_i^- & \text{if $i=N$} \end{array} \right.\]
where \[\Delta^-_i = \frac{y(t_i)-y(t_{i-1})}{t_i-t_{i-1}},\quad \Delta_i^+=\frac{y(t_{i+1})-y(t_i)}{t_{i+1}-t_i}\]
and where \[ w^-_i = \frac{t_{i+1}-t_i}{t_{i+1} - t_{i-1}},\quad w^+_i = \frac{t_i-t_{i-1}}{t_{i+1}-t_{i-1}}. \]
By definition then,\[\mathrm{m10}=\min_{1\leq i\leq N}\widehat{y'}(t_i).\]
m11: Maximum of the first derivative. This is \[\mathrm{m11} = \max_{1\leq i\leq N}\widehat{y'}(t_i),\]where \(\widehat{y'}(t_i)\) is the trajectory define in the discussion of m10.
m12: Mean of the first derivative. This is \[\mathrm{m12} = \frac{1}{t_N-t_1}\sum_{i=1}^{N-1}\frac{\widehat{y'}(t_i)+\widehat{y'}(t_{i+1})}{2}(t_{i+1}-t_i),\]where \(\widehat{y'}(t_i)\) is the trajectory define in the discussion of m10.
m13: Standard deviation of the first derivative. This is defined by the formula
\[\mathrm{m13} = \sqrt{\frac{1}{t_N-t_1}\sum_{i=1}^{N-1}\frac{\left(\widehat{y'}(t_i)-\mathrm{m12}\right)^2 + \left(\widehat{y'}(t_{i+1})-\mathrm{m12}\right)^2}{2}(t_{i+1}-t_i)}. \]
m14: Minimum of the second derivative. Measures 14-17 concern \(y''(t)\), the second derivative of \(y(t)\). For this, a trajectory \(\widehat{y''}(t_i)\) is constructed from the trajectory \(\widehat{y'}(t_i)\) in the same way as \(\widehat{y'}(t_i)\) is constructed from \(y(t_i)\) (cf. m10):\[ \widehat{y''}(t_i) = \left\{ \begin{array}{cc} \Delta'^-_{\,i} & \text{if $i=1$} \\ w_i^-\Delta'^-_{\,i} + w_i^+\Delta'^+_{\,i} & \text{if $1<i<N$} \\ \Delta'^+_{\,i} & \text{if $i=N$} \end{array} \right. \]
where \(w_i^{\pm}\) are defined as in the description of m10 and where \[ \Delta'^-_{\,i} =\frac{\widehat{y'}(t_i) - \widehat{y'}(t_{i-1})}{t_i-t_{i-1}},\quad \Delta'^+_{\,i} =\frac{\widehat{y'}(t_{i+1}) - \widehat{y'}(t_{i})}{t_{i+1}-t_{i}}. \]
By definition then,\[\mathrm{m14}=\min_{1\leq i\leq N}\widehat{y''}(t_i).\]
m15: Maximum of the second derivative. This is \[\mathrm{m15} = \max_{1\leq i\leq N}\widehat{y''}(t_i),\]where \(\widehat{y''}(t_i)\) is the trajectory defined in the discussion of m14.
m16: Mean of the second derivative. This is \[\mathrm{m16} = \frac{1}{t_N-t_1}\sum_{i=1}^{N-1}\frac{\widehat{y''}(t_i)+\widehat{y''}(t_{i+1})}{2}(t_{i+1}-t_i),\]where \(\widehat{y''}(t_i)\) is the trajectory define in the discussion of m14.
m17: Standard deviation of the second derivative. This is defined by
\[\mathrm{m17} = \sqrt{\frac{1}{t_N-t_1}\sum_{i=1}^{N-1}\frac{\left(\widehat{y''}(t_i)-\mathrm{m16}\right)^2 + \left(\widehat{y''}(t_{i+1})-\mathrm{m16}\right)^2}{2}(t_{i+1}-t_i)}. \]
m18: Later change/Early change. Given an observation time \(t_m\) with \(1<m<N\) which is to act as the “midpoint” of the trajectory, this is\[\mathrm{m18} = \frac{y(t_N)-y(t_m)}{y(t_m)-y(t_1)}.\]
In the event that both the numerator and denominator of m18 are 0, m18 is set to 1.
If the cap.outliers
argument of the
Step1Measures
function is set to TRUE
or if
measure contains values that are infinite (cause by a division by 0),
the outliers will be capped as follows. In a recent paper published on
the arXiv [3], the author proves that for any continuous random
variable \(X\) with finite first two
moments there holds \[
P[|X-\mu|>k\sigma]<\sigma M(k)\alpha(k),
\]where \(M(k)\) is the least
upper bound of the probability density function on \(\{x \ | \ |x-\mu|>k\sigma\}\) and where
\(\alpha(k)\) is the unique real root
of the cubic polynomial\[t^3 + 2\pi kt^2 +
2\pi e k^2t - \frac{2\pi e}{\sigma M(k)}.\]Suppose that our data
set contains \(n\) trajectories. This
gives us a sample \(X_1,\ldots,X_n\)
from the distribution of \(X\). Our
capping procedure consists of the following steps.
After relabeling the observed values of \(X\) so that \(|X_1|\leq |X_2|\leq\ldots \leq |X_n|\), remove the last \(r=\min(1,n/100)\) observations.
From the remaining values \(X_1,\ldots, X_{n-r}\), we compute estimates \(\hat{\mu}\) and \(\hat{\sigma}\) of the mean and standard deviation as usual.
Still using only \(X_1,\ldots,
X_{n-r}\), approximate the probability density function of \(X\) using a kernel density estimator
(function stats:::density
), find the value of \(M(k)\) for this PDF and compute \(\alpha(k)\).
Using these, identify the smallest value of \(k\) for which \(\hat{\sigma}M(k)\alpha(k)<0.003\). If \(k^*\) denotes this smallest value of \(k\), replace the value of any \(X_i\) with \(X_i<\hat{\mu}-k^*\hat{\sigma}\) by \(\hat{\mu}-k^*\hat{\sigma}\) and we replace the value of any \(X_i\) with \(X_i>\hat{\mu}+k^*\hat{\sigma}\) by \(\hat{\mu}+k^*\hat{\sigma}\) .
[1] Leffondré K, Abrahamowicz M, Regeasse A, Hawker GA, Badley EM, McCusker J, Belzile E. Statistical measures were proposed for identifying longitudinal patterns of change in quantitative health indicators. J Clin Epidemiol. 2004 Oct; 57(10):1049-62. doi: 10.1016/j.jclinepi.2004.02.012. PMID: 15528056.
[2] Tibshirani R, Walther G, Hastie T. Estimating the Number of Clusters in a Data Set Via the Gap Statistic, Journal of the Royal Statistical Society Series B: Statistical Methodology, Volume 63, Issue 2, July 2001, Pages 411–423
[3] Nishiyama T. Improved Chebyshev inequality: new probability bounds with known supremum of PDF arXiv:1808.10770v2
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.