TDAvec vignette

The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

TDAvec vignette

Persistence diagrams are a fundamental tool in topological data analysis (TDA) that summarize the multi-scale topological features (such as connected components, loops, and voids) of a dataset. They represent these features as points in a 2D plane, where each point corresponds to the birth and death of a feature across different scales.

Since persistence diagrams are multisets of points in a 2D plane, their unordered structure makes them challenging to use directly in machine learning models. By vectorizing them (i.e., transforming into fixed-length vectors), conventional statistical and machine learning techniques can be applied while preserving topological information. The TDAvec R package is specifically designed for this task, offering 13 vectorization methods commonly used in TDA. These methods are divided into three broad categories:

For improved computational efficiency, all code behind the vector summaries of TDAvec is written in C++ using the Rcpp and RcppArmadillo packages. In this vignette, we illustrate the basic usage of the package using simple examples.

library(TDAvec)
library(TDAstats) # to compute persistence diagrams

Now, we generate a 2D point cloud of size 100 sampled uniformly from a unit circle with added Gaussian noise:

# the number of points to sample
N <- 100 
# set a random seed for reproducibility
set.seed(123) 
# sample N points uniformly from the unit circle and add Gaussian noise
theta <- runif(N, min = 0, max = 2 * pi)
X <- cbind(cos(theta), sin(theta)) + rnorm(2 * N, mean = 0, sd = 0.2)
# plot the point cloud
plot(X,pch = 20,asp = 1,xlab = 'x',ylab = 'y')

Next, we use the TDAstats package to compute the persistence diagram (PD) of a Vietoris-Rips filtration built on top of the point cloud \(X\).

D <- calculate_homology(X,dim=1,threshold=2)
sum(D[,1]==0) # number of connected components
#> [1] 99
sum(D[,1]==1) # number of loops
#> [1] 13
sum(D[,1]==2) # number of voids
#> [1] 0

In the calculate_homology() function, dim is the maximum homological dimension of the topological features to be computed (connected components if dim=0; connected components and loops if dim=1; connected components, loops and voids if dim=2, etc.). threshold is the maximum value of the scale parameter of the filtration (which we set equal to 2 since the points are sampled from a circle with diameter 2).

The persistence diagram \(D\) has 99 connected components (the point cloud size - 1; TDAstats drops the connected component with infinite death value), 13 loops (one with long life-span, the rest are short-lived) and 0 voids along with their birth and death values. To plot the diagram, we can use the plot_persisit() function.

In the plot, the solid dots and triangles represent connected components and loops respectively.

Let’s compute a vector summary of one of the simplest summary functions, the Betti curve, for homological dimension \(H_1\).

# sequence of scale values to be used for vectorization
scaleSeq = seq(0,1.5,length.out=16) 
# vectorize the Betti curve for homological dimension H_1
computeBettiCurve(D,homDim=1,scaleSeq)
#>  [1] 0.0000000 0.5684067 1.6029493 1.1455386 1.0560370 1.0000000 1.0000000
#>  [8] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.6921215 0.0000000
#> [15] 0.0000000

By default, vectorization in this case is performed by computing the average values of the Betti curve over consecutive intervals defined by an increasing sequence of scale points (evaluate=“intervals”). This vectorization method is adopted as the default for all univariate summary functions that are easy to integrate. More specifically, if \(f\) is a (univariate) summary function and \(t_1,t_2,\ldots,t_n\) are increasing scale values, we discretize \(f\) as:

\[\begin{equation} \Big(\frac{1}{\Delta t_1}\int_{t_1}^{t_2}f(t)dt,\frac{1}{\Delta t_2}\int_{t_2}^{t_3}f(t)dt,\ldots,\frac{1}{\Delta t_{n-1}}\int_{t_{n-1}}^{t_n}f(t)dt\Big)\in\mathbb{R}^{n-1}, \end{equation}\]

Alternatively, by setting the evaluate argument to “points”, one can vectorize the Betti curve and all the other univariate summary functions by evaluating them at each scale point and arranging the values into a vector:

\[\begin{equation} (f(t_1),f(t_2),\ldots,f(t_n))\in\mathbb{R}^{n}, \end{equation}\]

As a note, the computePersistenceLandscape() function supports both classic and generalized persistence landscape functions. For example:

# classic
computePersistenceLandscape(D,homDim=1,scaleSeq,k=3) # k = the number of persistence landscape functions to consider (default is 1)
#>              [,1]       [,2] [,3]
#>  [1,] 0.000000000 0.00000000    0
#>  [2,] 0.000000000 0.00000000    0
#>  [3,] 0.014217630 0.00000000    0
#>  [4,] 0.016566914 0.01077521    0
#>  [5,] 0.000931983 0.00000000    0
#>  [6,] 0.091114536 0.00000000    0
#>  [7,] 0.191114536 0.00000000    0
#>  [8,] 0.291114536 0.00000000    0
#>  [9,] 0.391114536 0.00000000    0
#> [10,] 0.369212150 0.00000000    0
#> [11,] 0.269212150 0.00000000    0
#> [12,] 0.169212150 0.00000000    0
#> [13,] 0.069212150 0.00000000    0
#> [14,] 0.000000000 0.00000000    0
#> [15,] 0.000000000 0.00000000    0
#> [16,] 0.000000000 0.00000000    0

# generalized
computePersistenceLandscape(D,homDim=1,scaleSeq,k=3,generalized=TRUE,kernel = "epanechnikov",h=0.2) # h = bandwidth for the kernel function
#>             [,1]         [,2]        [,3]
#>  [1,] 0.03443388 1.830831e-02 0.011106582
#>  [2,] 0.04308709 2.187928e-02 0.020820527
#>  [3,] 0.04701315 2.925303e-02 0.021201592
#>  [4,] 0.04621208 3.322784e-02 0.019451506
#>  [5,] 0.04068387 3.380371e-02 0.016559979
#>  [6,] 0.18291816 3.098063e-02 0.030428519
#>  [7,] 0.30725636 2.475862e-02 0.015446032
#>  [8,] 0.38857823 1.513766e-02 0.004707859
#>  [9,] 0.42688376 2.693422e-03 0.002117754
#> [10,] 0.42217296 1.126474e-06 0.000000000
#> [11,] 0.37444582 0.000000e+00 0.000000000
#> [12,] 0.28370235 0.000000e+00 0.000000000
#> [13,] 0.14994254 0.000000e+00 0.000000000
#> [14,] 0.00000000 0.000000e+00 0.000000000
#> [15,] 0.00000000 0.000000e+00 0.000000000
#> [16,] 0.00000000 0.000000e+00 0.000000000

Computing algebraic and statistical vector summaries follows a similar approach. For example:

computeAlgebraicFunctions(D,homDim=0)  
#>          f1          f2          f3          f4 
#> 0.000000000 2.554431476 0.000000000 0.001950765

returns four algebraic functions based on the birth and death values (refer to the help documentation for more details).

Persistence surface and persistence block are bivariate summary functions and to vectorize them, we first need to switch from the birth-death to the birth-persistence coordinates:

D[,3] <- D[,3] - D[,2] 
colnames(D)[3] <- "Persistence"

Below, we compute the persistence image, which is a vector summary of the persistence surface:

# Persistence Image (PI)
resB <- 5 # resolution (or grid size) along the birth axis
resP <- 5 # resolution (or grid size) along the persistence axis 
# find min and max persistence values for dimension H_0
minPH0 <- min(D[D[,1]==0,3]); maxPH0 <- max(D[D[,1]==0,3])
# construct one-dimensional grid of scale values
ySeqH0 <- seq(minPH0,maxPH0,length.out=resP+1)
# default way of selecting the standard deviation sigma of the Gaussians on top of each point of the diagram
sigma <- 0.5*(maxPH0-minPH0)/resP 
# compute PI for homological dimension H_0
computePersistenceImage(D,homDim=0,xSeq=NA,ySeqH0,sigma)
#> [1]  4.668020 10.986339 10.856586  7.802121  3.695877

Since the \(H_0\) features all have the birth value of zero in this case, a one-dimensional grid of scale values is used for vectorization.

For homological dimension \(H_1\), the birth values are not all the same and therefore the vectorization is performed over a two-dimensional grid.

# Persistence Image (PI)
# find min & max birth and persistence values for dimension H_1
minBH1 <- min(D[D[,1]==1,2]); maxBH1 <- max(D[D[,1]==1,2])
minPH1 <- min(D[D[,1]==1,3]); maxPH1 <- max(D[D[,1]==1,3])
xSeqH1 <- seq(minBH1,maxBH1,length.out=resB+1)
ySeqH1 <- seq(minPH1,maxPH1,length.out=resP+1)
sigma <- 0.5*(maxPH1-minPH1)/resP
# compute PI for homological dimension H_1
computePersistenceImage(D,homDim=1,xSeqH1,ySeqH1,sigma) 
#>  [1] 4.371601e-02 4.955567e-02 4.512378e-02 3.299957e-02 1.864442e-02
#>  [6] 7.527654e-03 7.760847e-03 6.305986e-03 4.250972e-03 2.278547e-03
#> [11] 6.038006e-05 5.841014e-05 4.427399e-05 3.094726e-05 2.021064e-05
#> [16] 1.834675e-04 8.921009e-04 2.714215e-03 5.171823e-03 6.175378e-03
#> [21] 3.853840e-03 1.874023e-02 5.701774e-02 1.086451e-01 1.297270e-01

The code for computePersistenceImage() is adopted from the pers.image() function (available in the kernelTDA package) with some modifications. For example, pers.image() uses a one-dimensional grid for homological dimension \(H_0\) regardless of the filtration type. In contrast, computePersistenceImage() uses a one-dimensional grid only if additionally the birth values are the same (which may not be true for some filtration types). Moreover, pers.image() uses a square grid (e.g., 10 by 10) for vectorization, whereas computePersistenceImage() is not limited to such a grid and can compute vector summaries using a rectangular grid (e.g., 10 by 20).

References

Chazal, F., & Michel, B. (2021). An Introduction to Topological Data Analysis: Fundamental and Practical Aspects for Data Scientists. Frontiers in Artificial Intelligence, 108.
Chung, Y. M., & Lawson, A. (2022). Persistence curves: A canonical framework for summarizing persistence diagrams. Advances in Computational Mathematics, 48(1), 1-42.
Bubenik, P. (2015). Statistical topological data analysis using persistence landscapes. Journal of Machine Learning Research, 16(1), 77-102.
Berry, E., Chen, Y. C., Cisewski-Kehe, J., & Fasy, B. T. (2020). Functional summaries of persistence diagrams. Journal of Applied and Computational Topology, 4(2):211–262.
Adcock, A., Carlsson, E. and Carlsson, G., 2013. The ring of algebraic functions on persistence bar codes. Homology, Homotopy Appl., 18:381–402, 2016.
Ali, D., Asaad, A., Jimenez, M.J., Nanda, V., Paluzo-Hidalgo, E. and Soriano-Trigueros, M., (2023). A survey of vectorization methods in topological data analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Adams, H., Emerson, T., Kirby, M., Neville, R., Peterson, C., Shipman, P., … & Ziegelmeier, L. (2017). Persistence images: A stable vector representation of persistent homology. Journal of Machine Learning Research, 18.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.