The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
Persistence diagrams are a fundamental tool in topological data analysis (TDA) that summarize the multi-scale topological features (such as connected components, loops, and voids) of a dataset. They represent these features as points in a 2D plane, where each point corresponds to the birth and death of a feature across different scales.
Since persistence diagrams are multisets of points in a 2D plane,
their unordered structure makes them challenging to use directly in
machine learning models. By vectorizing them (i.e., transforming into
fixed-length vectors), conventional statistical and machine learning
techniques can be applied while preserving topological information. The
TDAvec
R package is specifically designed for this task,
offering 13 vectorization methods commonly used in TDA. These methods
are divided into three broad categories:
For improved computational efficiency, all code behind the vector
summaries of TDAvec
is written in C++
using
the Rcpp
and RcppArmadillo
packages. In this
vignette, we illustrate the basic usage of the package using simple
examples.
Let’s first load the required libraries.
Now, we generate a 2D point cloud of size 100 sampled uniformly from a unit circle with added Gaussian noise:
# the number of points to sample
N <- 100
# set a random seed for reproducibility
set.seed(123)
# sample N points uniformly from the unit circle and add Gaussian noise
theta <- runif(N, min = 0, max = 2 * pi)
X <- cbind(cos(theta), sin(theta)) + rnorm(2 * N, mean = 0, sd = 0.2)
# plot the point cloud
plot(X,pch = 20,asp = 1,xlab = 'x',ylab = 'y')
Next, we use the TDAstats
package to compute the
persistence diagram (PD) of a Vietoris-Rips filtration built on top of
the point cloud \(X\).
D <- calculate_homology(X,dim=1,threshold=2)
sum(D[,1]==0) # number of connected components
#> [1] 99
sum(D[,1]==1) # number of loops
#> [1] 13
sum(D[,1]==2) # number of voids
#> [1] 0
In the calculate_homology()
function, dim
is the maximum homological dimension of the topological features to be
computed (connected components if dim=0
; connected
components and loops if dim=1
; connected components, loops
and voids if dim=2
, etc.). threshold
is the
maximum value of the scale parameter of the filtration (which we set
equal to 2 since the points are sampled from a circle with diameter
2).
The persistence diagram \(D\) has 99
connected components (the point cloud size - 1; TDAstats
drops the connected component with infinite death value), 13 loops (one
with long life-span, the rest are short-lived) and 0 voids along with
their birth
and death
values. To plot the
diagram, we can use the plot_persisit()
function.
In the plot, the solid dots and triangles represent connected components and loops respectively.
Let’s compute a vector summary of one of the simplest summary functions, the Betti curve, for homological dimension \(H_1\).
# sequence of scale values to be used for vectorization
scaleSeq = seq(0,1.5,length.out=16)
# vectorize the Betti curve for homological dimension H_1
computeBettiCurve(D,homDim=1,scaleSeq)
#> [1] 0.0000000 0.5684067 1.6029493 1.1455386 1.0560370 1.0000000 1.0000000
#> [8] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.6921215 0.0000000
#> [15] 0.0000000
By default, vectorization in this case is performed by computing the
average values of the Betti curve over consecutive intervals defined by
an increasing sequence of scale points
(evaluate
=“intervals”). This vectorization method is
adopted as the default for all univariate summary functions that are
easy to integrate. More specifically, if \(f\) is a (univariate) summary function and
\(t_1,t_2,\ldots,t_n\) are increasing
scale values, we discretize \(f\)
as:
\[\begin{equation} \Big(\frac{1}{\Delta t_1}\int_{t_1}^{t_2}f(t)dt,\frac{1}{\Delta t_2}\int_{t_2}^{t_3}f(t)dt,\ldots,\frac{1}{\Delta t_{n-1}}\int_{t_{n-1}}^{t_n}f(t)dt\Big)\in\mathbb{R}^{n-1}, \end{equation}\]
where \(\Delta t_k=t_{k+1}-t_k\), \(k=1,\ldots,n-1\).
Alternatively, by setting the evaluate
argument to
“points”, one can vectorize the Betti curve and all the other univariate
summary functions by evaluating them at each scale point and arranging
the values into a vector:
\[\begin{equation} (f(t_1),f(t_2),\ldots,f(t_n))\in\mathbb{R}^{n}, \end{equation}\]
As a note, the computePersistenceLandscape()
function
supports both classic and generalized persistence landscape functions.
For example:
# classic
computePersistenceLandscape(D,homDim=1,scaleSeq,k=3) # k = the number of persistence landscape functions to consider (default is 1)
#> [,1] [,2] [,3]
#> [1,] 0.000000000 0.00000000 0
#> [2,] 0.000000000 0.00000000 0
#> [3,] 0.014217630 0.00000000 0
#> [4,] 0.016566914 0.01077521 0
#> [5,] 0.000931983 0.00000000 0
#> [6,] 0.091114536 0.00000000 0
#> [7,] 0.191114536 0.00000000 0
#> [8,] 0.291114536 0.00000000 0
#> [9,] 0.391114536 0.00000000 0
#> [10,] 0.369212150 0.00000000 0
#> [11,] 0.269212150 0.00000000 0
#> [12,] 0.169212150 0.00000000 0
#> [13,] 0.069212150 0.00000000 0
#> [14,] 0.000000000 0.00000000 0
#> [15,] 0.000000000 0.00000000 0
#> [16,] 0.000000000 0.00000000 0
# generalized
computePersistenceLandscape(D,homDim=1,scaleSeq,k=3,generalized=TRUE,kernel = "epanechnikov",h=0.2) # h = bandwidth for the kernel function
#> [,1] [,2] [,3]
#> [1,] 0.03443388 1.830831e-02 0.011106582
#> [2,] 0.04308709 2.187928e-02 0.020820527
#> [3,] 0.04701315 2.925303e-02 0.021201592
#> [4,] 0.04621208 3.322784e-02 0.019451506
#> [5,] 0.04068387 3.380371e-02 0.016559979
#> [6,] 0.18291816 3.098063e-02 0.030428519
#> [7,] 0.30725636 2.475862e-02 0.015446032
#> [8,] 0.38857823 1.513766e-02 0.004707859
#> [9,] 0.42688376 2.693422e-03 0.002117754
#> [10,] 0.42217296 1.126474e-06 0.000000000
#> [11,] 0.37444582 0.000000e+00 0.000000000
#> [12,] 0.28370235 0.000000e+00 0.000000000
#> [13,] 0.14994254 0.000000e+00 0.000000000
#> [14,] 0.00000000 0.000000e+00 0.000000000
#> [15,] 0.00000000 0.000000e+00 0.000000000
#> [16,] 0.00000000 0.000000e+00 0.000000000
Computing algebraic and statistical vector summaries follows a similar approach. For example:
computeAlgebraicFunctions(D,homDim=0)
#> f1 f2 f3 f4
#> 0.000000000 2.554431476 0.000000000 0.001950765
returns four algebraic functions based on the birth and death values (refer to the help documentation for more details).
Persistence surface and persistence block are bivariate summary functions and to vectorize them, we first need to switch from the birth-death to the birth-persistence coordinates:
Below, we compute the persistence image, which is a vector summary of the persistence surface:
# Persistence Image (PI)
resB <- 5 # resolution (or grid size) along the birth axis
resP <- 5 # resolution (or grid size) along the persistence axis
# find min and max persistence values for dimension H_0
minPH0 <- min(D[D[,1]==0,3]); maxPH0 <- max(D[D[,1]==0,3])
# construct one-dimensional grid of scale values
ySeqH0 <- seq(minPH0,maxPH0,length.out=resP+1)
# default way of selecting the standard deviation sigma of the Gaussians on top of each point of the diagram
sigma <- 0.5*(maxPH0-minPH0)/resP
# compute PI for homological dimension H_0
computePersistenceImage(D,homDim=0,xSeq=NA,ySeqH0,sigma)
#> [1] 4.668020 10.986339 10.856586 7.802121 3.695877
Since the \(H_0\) features all have the birth value of zero in this case, a one-dimensional grid of scale values is used for vectorization.
For homological dimension \(H_1\), the birth values are not all the same and therefore the vectorization is performed over a two-dimensional grid.
# Persistence Image (PI)
# find min & max birth and persistence values for dimension H_1
minBH1 <- min(D[D[,1]==1,2]); maxBH1 <- max(D[D[,1]==1,2])
minPH1 <- min(D[D[,1]==1,3]); maxPH1 <- max(D[D[,1]==1,3])
xSeqH1 <- seq(minBH1,maxBH1,length.out=resB+1)
ySeqH1 <- seq(minPH1,maxPH1,length.out=resP+1)
sigma <- 0.5*(maxPH1-minPH1)/resP
# compute PI for homological dimension H_1
computePersistenceImage(D,homDim=1,xSeqH1,ySeqH1,sigma)
#> [1] 4.371601e-02 4.955567e-02 4.512378e-02 3.299957e-02 1.864442e-02
#> [6] 7.527654e-03 7.760847e-03 6.305986e-03 4.250972e-03 2.278547e-03
#> [11] 6.038006e-05 5.841014e-05 4.427399e-05 3.094726e-05 2.021064e-05
#> [16] 1.834675e-04 8.921009e-04 2.714215e-03 5.171823e-03 6.175378e-03
#> [21] 3.853840e-03 1.874023e-02 5.701774e-02 1.086451e-01 1.297270e-01
The code for computePersistenceImage()
is adopted from
the pers.image()
function (available in the
kernelTDA
package) with some modifications. For example,
pers.image()
uses a one-dimensional grid for homological
dimension \(H_0\) regardless of the
filtration type. In contrast, computePersistenceImage()
uses a one-dimensional grid only if additionally the birth values are
the same (which may not be true for some filtration types). Moreover,
pers.image()
uses a square grid (e.g., 10 by 10) for
vectorization, whereas computePersistenceImage()
is not
limited to such a grid and can compute vector summaries using a
rectangular grid (e.g., 10 by 20).
Chazal, F., & Michel, B. (2021). An Introduction to Topological Data Analysis: Fundamental and Practical Aspects for Data Scientists. Frontiers in Artificial Intelligence, 108.
Chung, Y. M., & Lawson, A. (2022). Persistence curves: A canonical framework for summarizing persistence diagrams. Advances in Computational Mathematics, 48(1), 1-42.
Bubenik, P. (2015). Statistical topological data analysis using persistence landscapes. Journal of Machine Learning Research, 16(1), 77-102.
Berry, E., Chen, Y. C., Cisewski-Kehe, J., & Fasy, B. T. (2020). Functional summaries of persistence diagrams. Journal of Applied and Computational Topology, 4(2):211–262.
Adcock, A., Carlsson, E. and Carlsson, G., 2013. The ring of algebraic functions on persistence bar codes. Homology, Homotopy Appl., 18:381–402, 2016.
Ali, D., Asaad, A., Jimenez, M.J., Nanda, V., Paluzo-Hidalgo, E. and Soriano-Trigueros, M., (2023). A survey of vectorization methods in topological data analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Adams, H., Emerson, T., Kirby, M., Neville, R., Peterson, C., Shipman, P., … & Ziegelmeier, L. (2017). Persistence images: A stable vector representation of persistent homology. Journal of Machine Learning Research, 18.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.