A principal curve is a smooth curve passing through the middle of a multidimensional dataset. This package is an R/C++ reimplementation of the S/Fortran code provided by Trevor Hastie, with multiple performance tweaks.

Example

Deriving a principal curve is an iterative process. This is what it looks like for a two-dimensional toy dataset:

Algorithm

Pseudocode for the princurves algorithm is shown below. The individual steps will be explained in more detail in the following subsections.

# initialisation
s = principal_component(x)
x_proj = project(x, s)
lambda = arc_length(x_proj)

# iterative process
for (it = 1..max_iter) {
  s = smooth(lambda, x)
  s' = approximate(lambda, s, num_points = 100)
  x_proj = project(x, s')
  lambda = arc_length(x_proj)
}

Initialisation

The principal curve s is initialised (at iteration 0) by calculating the principal component. All points in x are projected orthogonally onto s, and the arc-length lambda of each projection w.r.t. to the start of the curve is calculated.

Iteration 1

Each iteration consists of three steps: smoothing, approximation, and projection.

Smoothing: calculate new curve

During the smoothing step, a new curve is computed by smoothing each dimension in x w.r.t. the arc-length lambda calculated for the previous curve.

Approximation: simplify curve

In the next step (projection), each of the n points in x will get compared to each segment in the curve s. After the smoothing step, the curve consists of n points. Thus, the projection step would have a quadratic complexity. In order to make this step to behave more linearly, the approx_points = 100 parameter can be used to first approximate by a curve with 100 points.

Projection: calculate new lambda

The projection step is same as before; all the points are orthogonally projected onto the new curve, and the arc-length lambda is recalculated for the new projections.

This process is repeated until convergence or until a predefined number of iterations has passed.

Iteration 2

For clarity’s sake, the smoothing and projection steps are also shown for iteration 2.

Smoothing: calculate new curve

During the smoothing step, a new curve is computed by smoothing each dimension in x w.r.t. the arc-length lambda calculated for the previous curve.

Approximation: simplify curve

The curve is simplified in order to make the projection step easier.

Projection: calculate new lambda

All the points are orthogonally projected onto the new curve, and the arc-length lambda is recalculated for the new projections.

Timing comparison

princurve 2.1 contains some major optimisations, if the approx_points parameter is used. This is showcased on a toy example, where the number of points was varied between \(10^2\) and \(10^6\).

We can see princurve 2.1 scales quasi linearly w.r.t. the number of rows in the dataset, whereas princurve 1.1 scales quadratically. This is due to the addition of the approximation step added in between the smoothing and the projection steps.