Introduction to depthR

Jason Parker

2026-06-22

What is Statistical Depth?

The mean and standard deviation are the workhorses of multivariate statistics. But they have a fundamental problem: they are not robust. A single outlier can pull the mean far from the center of the data, and the IQR — the standard tool for flagging outliers — has no natural multivariate generalization.

Statistical depth functions solve this. A depth function assigns to every point in \(\mathbb{R}^d\) a value measuring how central that point is with respect to a distribution. The deepest point is the multivariate median — a genuine robust generalization of the univariate median. Points near the periphery of the distribution have depth near zero.

Existing R packages for depth functions (ddalpha, depth, DepthProc) implement these ideas but cap out at \(d = 2\) or \(d = 3\), or are too slow for practical use at moderate dimension. depthR is built from the ground up in C++ to work in arbitrary dimension \(d\), with parallel computation and adaptive stopping rules that keep computation time under control.


The Five Depth Functions

depthR implements five depth functions covering the main theoretical families in the literature.

Mahalanobis Depth

The simplest depth function. Depth is a monotone decreasing function of the Mahalanobis distance from the estimated mean:

\[D(x, F) = \frac{1}{1 + (x - \mu)^\top \Sigma^{-1} (x - \mu)}\]

The deepest point is the sample mean. Mahalanobis depth is included as a baseline — it is fast and closed-form, but assumes elliptical symmetry and has zero breakdown point.

data <- matrix(rnorm(200), nrow = 50, ncol = 4)
x    <- rbind(colMeans(data),        # center
              colMeans(data) + 3)    # outlier direction

mahalanobis_depth(x, data)
#> [1] 1.00000000 0.02281656

Tukey (Halfspace) Depth

The canonical depth function, defined as the minimum probability mass in any closed halfspace containing \(x\):

\[TD(x, F) = \inf_{u \neq 0} P(u^\top X \leq u^\top x)\]

The Tukey median has breakdown point up to \(1/(d+1)\) and satisfies all four Zuo-Serfling axioms. Exact computation is \(O(n^{d-1})\); depthR uses an adaptive random projection approximation.

tukey_depth(x, data, batch_size = 50, min_batches = 3, patience = 2)
#> [1] 0.4 0.0

Liu Simplicial Depth

The probability that a random simplex formed by \(d+1\) draws from \(F\) contains \(x\):

\[SD(x, F) = P(x \in S[X_1, \ldots, X_{d+1}])\]

Simplicial depth has a beautiful geometric interpretation and reduces to the univariate median when \(d = 1\). depthR uses adaptive Monte Carlo with a Bernoulli standard error stopping rule.

simplicial_depth(x, data, batch_size = 50, min_batches = 3, max_batches = 5)
#> [1] 0.104 0.000

Projection Depth

Based on the Stahel-Donoho outlyingness — the supremum over all directions of the robust univariate Z-score:

\[O(x, F) = \sup_{u \neq 0} \frac{|u^\top x - \text{med}(u^\top F)|} {\text{MAD}(u^\top F)}, \quad PD(x, F) = \frac{1}{1 + O(x, F)}\]

Projection depth uses median and MAD rather than mean and SD, giving it a high breakdown point. It is affine invariant.

projection_depth(x, data, batch_size = 50, min_batches = 3, patience = 2)
#> [1] 0.63968225 0.08674677

Spatial Depth

Defined as one minus the norm of the mean unit vector pointing from the data toward \(x\):

\[SpD(x, F) = 1 - \left\| E\left[ \frac{x - X}{\|x - X\|} \right] \right\|\]

Spatial depth is the only function in depthR with a closed-form sample estimate — no Monte Carlo needed. This makes it extremely fast even at large \(n\) and \(d\).

spatial_depth(x, data)
#> [1] 0.94184375 0.03753945

The depth Object

All depth functions plug into compute_depth(), which returns a depth object from which medians, ranks, outliers, and central regions can be extracted without recomputing depth.

dd <- compute_depth(data, depth_fn = mahalanobis_depth)
dd
#> Depth object
#>   Call      : compute_depth(data = data, depth_fn = mahalanobis_depth) 
#>   n         : 50 observations
#>   d         : 4 dimensions
#>   Depth fn  : x$depth_fn 
#>   Depth range: [0.0542, 0.7121]

Depth-Based Median

The deepest point — a genuine robust estimator of multivariate location.

m <- median(dd)
cat("Median index:", m$index, "\n")
#> Median index: 26
print(round(m$point, 3))
#> [1] -0.430  0.581 -0.428 -0.151
cat("Depth:", round(m$depth, 4), "\n")
#> Depth: 0.7121

Depth-Based Ranks

Rank 1 is the deepest (most central) observation; rank \(n\) is the most outlying.

r <- rank(dd)
cat("5 deepest observations:\n")
#> 5 deepest observations:
print(which(r <= 5))
#> [1]  5  8 26 32 33

Central Region

The inner \(1 - \alpha\) fraction of the data by depth.

cr <- central_region(dd, alpha = 0.50)
cat("Inner 50% contains", length(cr$indices), "observations\n")
#> Inner 50% contains 25 observations

Outlier Detection

set.seed(1)
clean             <- matrix(rnorm(200), nrow = 100, ncol = 2)
outliers_injected <- matrix(runif(10, min = 4, max = 6), nrow = 5, ncol = 2)
contaminated      <- rbind(clean, outliers_injected)

dd_cont <- compute_depth(contaminated, depth_fn = mahalanobis_depth)
out     <- outliers(dd_cont, threshold = 0.05)

cat("Injected outliers are rows 101-105\n")
#> Injected outliers are rows 101-105
cat("Detected outlier indices:\n")
#> Detected outlier indices:
print(sort(out$indices))
#> [1]  56 101 102 103 104 105

Visualising Outliers

plot(dd_cont, outlier_threshold = 0.05,
     main = "Mahalanobis depth — injected outliers flagged in red")


The DD-Plot

The depth-depth plot is the multivariate analog of the QQ-plot.

set.seed(2)
x_same <- matrix(rnorm(200), nrow = 100, ncol = 2)
y_same <- matrix(rnorm(200), nrow = 100, ncol = 2)

dd_plot(x_same, y_same,
        depth_fn = mahalanobis_depth,
        main     = "DD-Plot: same distribution")

y_shift <- matrix(rnorm(200, mean = 2), nrow = 100, ncol = 2)

dd_plot(x_same, y_shift,
        depth_fn = mahalanobis_depth,
        main     = "DD-Plot: location shift of 2")


High-Dimensional Performance

depthR works in high dimension. Here we demonstrate on \(d = 10\).

set.seed(3)
n       <- 100
d       <- 10
data_hd <- matrix(rnorm(n * d), nrow = n, ncol = d)
x_hd    <- rbind(colMeans(data_hd), colMeans(data_hd) + 4)
cat("Dimension d =", d, ", n =", n, "\n")
#> Dimension d = 10 , n = 100
cat("Mahalanobis depth:\n")
#> Mahalanobis depth:
print(round(mahalanobis_depth(x_hd, data_hd), 4))
#> [1] 1.0000 0.0044
cat("\nTukey depth:\n")
print(round(tukey_depth(x_hd, data_hd,
            batch_size = 50, min_batches = 3, patience = 2), 4))
#> 
#> Tukey depth:
#> [1] 0.42 0.00
cat("\nSpatial depth:\n")
#> 
#> Spatial depth:
print(round(spatial_depth(x_hd, data_hd), 4))
#> [1] 0.9851 0.0272

References