The mean and standard deviation are the workhorses of multivariate statistics. But they have a fundamental problem: they are not robust. A single outlier can pull the mean far from the center of the data, and the IQR — the standard tool for flagging outliers — has no natural multivariate generalization.
Statistical depth functions solve this. A depth function assigns to every point in \(\mathbb{R}^d\) a value measuring how central that point is with respect to a distribution. The deepest point is the multivariate median — a genuine robust generalization of the univariate median. Points near the periphery of the distribution have depth near zero.
Existing R packages for depth functions (ddalpha, depth, DepthProc) implement these ideas but cap out at \(d = 2\) or \(d = 3\), or are too slow for practical use at moderate dimension. depthR is built from the ground up in C++ to work in arbitrary dimension \(d\), with parallel computation and adaptive stopping rules that keep computation time under control.
depthR implements five depth functions covering the main theoretical families in the literature.
The simplest depth function. Depth is a monotone decreasing function of the Mahalanobis distance from the estimated mean:
\[D(x, F) = \frac{1}{1 + (x - \mu)^\top \Sigma^{-1} (x - \mu)}\]
The deepest point is the sample mean. Mahalanobis depth is included as a baseline — it is fast and closed-form, but assumes elliptical symmetry and has zero breakdown point.
The canonical depth function, defined as the minimum probability mass in any closed halfspace containing \(x\):
\[TD(x, F) = \inf_{u \neq 0} P(u^\top X \leq u^\top x)\]
The Tukey median has breakdown point up to \(1/(d+1)\) and satisfies all four Zuo-Serfling axioms. Exact computation is \(O(n^{d-1})\); depthR uses an adaptive random projection approximation.
#> [1] 0.4 0.0
The probability that a random simplex formed by \(d+1\) draws from \(F\) contains \(x\):
\[SD(x, F) = P(x \in S[X_1, \ldots, X_{d+1}])\]
Simplicial depth has a beautiful geometric interpretation and reduces to the univariate median when \(d = 1\). depthR uses adaptive Monte Carlo with a Bernoulli standard error stopping rule.
#> [1] 0.104 0.000
Based on the Stahel-Donoho outlyingness — the supremum over all directions of the robust univariate Z-score:
\[O(x, F) = \sup_{u \neq 0} \frac{|u^\top x - \text{med}(u^\top F)|} {\text{MAD}(u^\top F)}, \quad PD(x, F) = \frac{1}{1 + O(x, F)}\]
Projection depth uses median and MAD rather than mean and SD, giving it a high breakdown point. It is affine invariant.
#> [1] 0.63968225 0.08674677
Defined as one minus the norm of the mean unit vector pointing from the data toward \(x\):
\[SpD(x, F) = 1 - \left\| E\left[ \frac{x - X}{\|x - X\|} \right] \right\|\]
Spatial depth is the only function in depthR with a closed-form sample estimate — no Monte Carlo needed. This makes it extremely fast even at large \(n\) and \(d\).
All depth functions plug into compute_depth(), which
returns a depth object from which medians, ranks, outliers,
and central regions can be extracted without recomputing depth.
dd <- compute_depth(data, depth_fn = mahalanobis_depth)
dd
#> Depth object
#> Call : compute_depth(data = data, depth_fn = mahalanobis_depth)
#> n : 50 observations
#> d : 4 dimensions
#> Depth fn : x$depth_fn
#> Depth range: [0.0542, 0.7121]The deepest point — a genuine robust estimator of multivariate location.
Rank 1 is the deepest (most central) observation; rank \(n\) is the most outlying.
set.seed(1)
clean <- matrix(rnorm(200), nrow = 100, ncol = 2)
outliers_injected <- matrix(runif(10, min = 4, max = 6), nrow = 5, ncol = 2)
contaminated <- rbind(clean, outliers_injected)
dd_cont <- compute_depth(contaminated, depth_fn = mahalanobis_depth)
out <- outliers(dd_cont, threshold = 0.05)
cat("Injected outliers are rows 101-105\n")
#> Injected outliers are rows 101-105
cat("Detected outlier indices:\n")
#> Detected outlier indices:
print(sort(out$indices))
#> [1] 56 101 102 103 104 105The depth-depth plot is the multivariate analog of the QQ-plot.
set.seed(2)
x_same <- matrix(rnorm(200), nrow = 100, ncol = 2)
y_same <- matrix(rnorm(200), nrow = 100, ncol = 2)
dd_plot(x_same, y_same,
depth_fn = mahalanobis_depth,
main = "DD-Plot: same distribution")y_shift <- matrix(rnorm(200, mean = 2), nrow = 100, ncol = 2)
dd_plot(x_same, y_shift,
depth_fn = mahalanobis_depth,
main = "DD-Plot: location shift of 2")depthR works in high dimension. Here we demonstrate on \(d = 10\).
set.seed(3)
n <- 100
d <- 10
data_hd <- matrix(rnorm(n * d), nrow = n, ncol = d)
x_hd <- rbind(colMeans(data_hd), colMeans(data_hd) + 4)
cat("Dimension d =", d, ", n =", n, "\n")
#> Dimension d = 10 , n = 100cat("Mahalanobis depth:\n")
#> Mahalanobis depth:
print(round(mahalanobis_depth(x_hd, data_hd), 4))
#> [1] 1.0000 0.0044cat("\nTukey depth:\n")
print(round(tukey_depth(x_hd, data_hd,
batch_size = 50, min_batches = 3, patience = 2), 4))#>
#> Tukey depth:
#> [1] 0.42 0.00
cat("\nSpatial depth:\n")
#>
#> Spatial depth:
print(round(spatial_depth(x_hd, data_hd), 4))
#> [1] 0.9851 0.0272Liu, R. Y. (1990). On a notion of data depth based on random simplices. Annals of Statistics, 18(1), 405–414.
Vardi, Y. & Zhang, C.-H. (2000). The multivariate L1-median and associated data depth. PNAS, 97(4), 1423–1426.
Zuo, Y. & Serfling, R. (2000). General notions of statistical depth function. Annals of Statistics, 28(2), 461–482.
Serfling, R. (2006). Depth functions in nonparametric multivariate inference. DIMACS Series in Discrete Mathematics, 72, 1–16.