Unfortunately, creating package vignettes in LaTex using Rmarkdown is difficult because of the fragilty of the interface. After spending several days on this I finally gave in and rewrote them in markdown. This was painful, as I had to spend a lot of time on rewriting the documentation instead of debugging the package. This means that the quality of these vignettes is not what I would have liked. On the positive side, I can now send cool emails with markdown :-)
spca
is an R package for running Sparse Principal Component Analysis. It implements the LS SPCA approach that computes the Least Squares estimates of sparse PCs (Merola (2014)). Unlike other SPCA methods, these solutions maximise the variance of the data explained.
The implementation is completely in R
, so it can only be run on small datasets (the limit depends on the hardware used, we were able to solve problems with about 1000 variables in minutes). The package is self contained as it only depends on the library MASS
which is part of the basic distribution of R.
Details about LS SPCA and the methodology impmented in the package can be found in (Merola, 2014. arXiv) and in the forthcoming peer reviewed paper.
I had difficulties publishing the LS SPCA paper, possibly because LS SPCA improves on existing methods. This is confirmed by the fact that Technometrics’ chief editor, Dr Qiu, rejected the paper endorsing a report stating that: the LS criterion is a new measure used ad-hoc :-D This on top of a number of blatantly wrong arguments. I am now waiting for the response of a reviewer who asked me to compare the about 20 existing SPCA methods with mine on more datasets (only because I show that my solutions maximise the variance explained and theirs don’t)!
Principal Component Analysis was developed by Pearson (1901) to attain the components that minimised the LS criterion when approximating the data matrix. If \(X\) is a matrix with n rows of observations on p variables, the PCs are defined by the loadings \(a_j\) as \(t_j = X a_j\). The matrix of \(d<p\) PCs, \(T = XA\), is derived as the regressors that minimise the Residual Sum of Squares. By the principle of the Extra Sum of Squares the components can be constrained to be uncorrelated without loss of optimality. Therefore the PCA problem is obtained by solving: \[ A = \text{arg}\min ||X - TP'||^2 = \text{arg}\max\frac{A'SSA}{A'SA} = \text{arg}\max \sum_{1}^d \frac{a_j'SSa_j}{a_j'Sa_j}\\ \text{subject to}\ a_j'Sa_k = 0,\ j\neq k, \] where \(S\) is the covariance matrix of the \(x\) variables. The terms in the last summation are the variance explained by each component. The solutions are proportional to the eigenvectors of \(S\) corresponding to the eigenvalues taken in nonicreasing order. It is well known that the eigenvectors are mutually uncorrelated.
Hotelling (1933) gives the PCs’ loadings as the eigenvectors of \(S\) with unit Euclidean norm. Using this normalisation the maximisation of the variance explained by each component simplifies to \[ A = \text{arg}\max \sum_{1}^d {a_j'Sa_j}\\ \text{subject to}\ a_j'Sa_k = \delta_{jk}, \] where \(\delta_{jk} = 1\) if \(j=k\) and \(\delta_{jk} = 0\) otherwise.
Because of its simplicity, Hotelling’s derivation has been adopted for popularizing PCA among pratictioners. This choice was unfortunate because the original objective of minimising the LS criterion has been somewhat forgotten. However, other than in Person’s original paper, the LS derivation is given in several books and papers cited in my paper ( e.g. tenBerge (1993) and Izenman (2008)).
When cardinality constraints (also called \(L_0\) norm constraints) are imposed on the original PCA problem, the loadings are no longer eigenvectors of \(S\). Therefore, Hotelling’s simplification is no longer equivalent to the variance explained. Furthermore, by the Cauchy-Schwartz inequality: \[ \frac{a_j'SSa_j}{a_j'Sa_j} \geq \frac{a_j'Sa_j}{a_j'a_j} \] for any square matrix S, with equality if and only if the vectors \(a_j\) are eigenvectors of \(S\). Therefore, the components with maximal variance are suboptimal for explaining the variance.
Other SPCA methods apply cardinality constraints to Hotelling’s definition, hence not optimising the variance explained. Instead, in LS SPCA we derive the loadings from to Pearson’s LS optimisation adding cardinality constraints.
The uncorrelated LS SPCA solutions are constrained Reduced Rank Regression solutions (see Izenman (1975), for the unconstrained solutions). The uncorrelatedness constraints limit the amount of variance explained by the solutions and require that the loadings have cardinality not smaller than their rank. Even though uncorrelated components are easier to interpret, in some cases uncorrelated ones can be useful. Therefore, we also provide correlated sparse loadings that approximately minimise the LS criterion.
Finding the optimal indices for an spca solution is an intractable NP-hard problem.
Therefore, we find the solutions through two greedy algorithms: Branch-and-Bound (BB) and Backward Elimination (BE).
BB searches for the solutions that sequentially maximise the variance explained under the constraints. The solutions may not be a global maximum when more than one component is computed. The BB algorithm is a modification of Farcomeni (2009)’s (thanks!).
BE has the goal of attaining larger contributions while minimising the LS criterium. It sequentially eliminates the smallest contributions (in absolute value) from a non-sparse solution.
The BE solutions will generally explain less variance than the BB ones. However, the BE algorithm is much faster and the solutions, usually, have larger loadings. The algorithm is illustrated in more details in the BE Algorithm vignette `vignettes(“BE algorithm”, package = “spca”).
SPCA aims to obtain interpretable solutions
Interpretability is not univocally defined. Hence, for a problem there exist a number of competing solutions. In Factor Analysis literature there is plenty of discussion about the definition of interpretable and simple solution (as qualities and mathematical functions).
Simplicity can be defined by different measures, being linked to sparseness, parsimony, variance explained and size of the loadings.
Therefore, for a given problem there usually exist several competing simple and interpretable solutions.
spca
is implemented as an exploratory data analysis tool
The cardinality of the components can be chosen interactively after inspecting trace and plots of solutions of different cardinality.
Solutions can be also computed non-interactively so as to:
spca contains methods for plotting and printing the solutions and for comparing different ones. In this way the solution can be chosen with respect to several different characteristicsm which cannot be all included in a function at the same time.
spca
can be helpful also in a confirmatory stage of the analysis, in fact * the components can be constrained to be made up of only a subset of the variables.
The workhorse of the package is the function spca
, which computes the optimal solutions for a given set of indices.
It is called simply with a list of indices and the the flags for correlated or uncorrelated components (one for each component, if necessary)
spca(S, ind, unc = TRUE)
The functions spcabb
and spcabe
implement the BB and BE searches, respectively.
spcabb(S, card, unc = TRUE, startind, excludeload = FALSE, nvexp = FALSE, msg = TRUE)
spcabe(S, nd = FALSE, ndbyvexp = FALSE, mincard = NULL, thresh = FALSE, threshvar = FALSE, threshvaronPC = FALSE, perc = TRUE,
unc = TRUE, trim = 1, reducetrim = TRUE, startind = NULL, excludeload = FALSE, diag = FALSE, choosecard = NULL, eps = 1e-04,
msg = TRUE)
Withhelp(spcabb)
and help(spcabe)
you will find examples of using spca and the utilities. In the vignettes(spca)
you will find a more complete example and details on the methods. These are available also in the Manual and a more complete example is given in the Advanced Example vignette.
There is also the function ‘pca’ which computes the PCA solutions and returns an spca object. The function is called as:
pca(S, nd, only.values = FALSE, screeplot = FALSE, kaiser.print = FALSE)
The package contains methods for plotting, printing and comparing spca solutions. These are:
choosecard
: interactive method for choosing the cardinality. It plots and prints statistics for comparing solutions of different cardinality.
print
: shows a formatted matrix of sparse loadings or contributions of a solution. Contributions are loadings expressed as percentages, while the loadings are scaled to unit sum of squares.
showload
: prints only the non-zero sparse loadings. This is useful when the number of variables is large.
summary
: shows formatted summary statistics of a solution
plot
: plots the cumulative variance explained by the sparse solutions versus that explained by the PCs, which is their upper bound. It can also plot the contributions in different ways.
compare
: plots and prints comparison of two or more spca objects.
The naming of the arguments in R is not simple, mainly because different syntaxes have been used over the years. I tried to give meaningful names starting differently so that R’s useful feature of partial matching the arguments can be exploited. In the following wxample I sometime use partial arguments names.
library(spca)
cat(paste("loaded spca version:", packageVersion("spca")))
#> loaded spca version: 0.6.0
data(bsbl)
#- ordinary PCA
bpca = pca(bsbl, screeplot = TRUE, kaiser.print = TRUE)
#> [1] "number of eigenvalues larger than 1 is 3"
#- sparse PCA with minimal contribution 25%
bbe1 <- spcabe(bsbl, nd = 4, thresh = 0.25, unc = FALSE)
#- summary output
summary(bbe1)
#> Comp1 Comp2 Comp3 Comp4
#> PVE 44.4% 24.9% 10.3% 5.6%
#> PCVE 44.4% 69.3% 79.6% 85.2%
#> PRCVE 96.4% 96.1% 96.6% 97.1%
#> Card 2 3 3 1
#> Ccard 2 5 8 9
#> PVE/Card 22.2% 8.3% 3.4% 5.6%
#> PCVE/Ccard 22.2% 13.9% 10% 9.5%
#> Converged 0 0 0 0
#> MinCont 31.5% 26.3% 28.2% 100%
#-# Explaining over 96% of the PCs' variance with 2, 3, 3 and 1 variables.
#- print percentage contributions
bbe1
#> Percentage Contributions
#> Comp1 Comp2 Comp3 Comp4
#> TAB_86 31.5 35.2
#> HR_86 28.2
#> RUN_86 26.3
#> RUN -38.5
#> RUNB 68.5
#> PO_86 100
#> ASS_86 -40.9
#> ERR_86 -30.9
#> ----- ----- ----- -----
#> PCVE 44.4 69.3 79.6 85.2
#>
#-# Simple combinations of offensive play in career and season are most important. Defensive play in season appears only in 3rd component.
#- The contributions can be printed one by one using the descriptive names in `bsbl_label`
data(bsbl_labels, package = "spca")
head(bsbl_labels)
#> short.name label
#> 1 TAB_86 times at bat in 1986
#> 2 HIT_86 hits in 1986
#> 3 HR_86 home runs in 1986
#> 4 RUN_86 runs in 1986
#> 5 RB_86 runs batted-in in 1986
#> 6 WAL_86 walks in 1986
showload(bbe1, variablesnames = bsbl_labels[,2])
#> [1] "Component 1"
#> times at bat in 1986 runs batted-in during his career
#> 31.5% 68.5%
#>
#> [1] "Component 2"
#> times at bat in 1986 runs in 1986 runs during his career
#> 35.2% 26.3% -38.5%
#>
#> [1] "Component 3"
#> home runs in 1986 assists in 1986 errors in 1986
#> 28.2% -40.9% -30.9%
#>
#> [1] "Component 4"
#> put outs in 1986
#> 100%
#>
#- plot solution
plot(bbe1, plotloadvsPC = TRUE, pc = bpca, mfr = 2, mfc = 2,
variablesnames = as.character(bsbl_labels[,2]))
#-# Explaining the variance pretty closely to PCA with much fewer variables.
#
The package development is in the GitHub repository GitHub repo
install.packages("spca")
if (packageVersion("spca") < 0.4.0) {
install.packages("devtools")
}
devtools::install_github("merolagio/spca")
This is the first release and will surely contain some bugs, even though I tried to test it. Please do let me know if you find any or can suggest improvements. Please use the Github tools for submitting bugs Bug report or contributions.
For now most of the plots are produced with the basic plotting functions. In a later release I will produce the plots with ggplot2 (requires learning the package better).
The code is implemented in R, so it will not work for large datasets. I have in mind to develop C routines at least for the matrix algebra. Anybody willing to help, please, let me know.
Farcomeni, Alessio. 2009. “An Exact Approach to Sparse Principal Component Analysis.” Computational Statistics 24 (4): 583–604.
Hotelling, H. 1933. “Analysis of a Complex of Statistical Variables with Principal Components.” Journal of Educational Psychology 24: 498–520.
Izenman, A. J. 1975. “Reduced-Rank Regression for the Multivariate Linear Model.” Journal of Multivariate Analysis 5 (2): 248–64.
———. 2008. Modern Multivariate Statistical Techniques : Regression, Classification, and Manifold Learning. Springer Texts in Statistics. Springer New York.
Merola, G. 2014. “Least Squares Sparse Principal Component Analysis: A Backward Elimination Approach to Attain Large Loadings.” To Appear in Australian & New Zealand J. Stats. – Preprint available at http://arxiv.org/abs/1406.1381.
Pearson, K. 1901. “On lines and planes of closest fit to systems of points in space.” Philosophical Magazine 2 (6): 559–72.
tenBerge, J. M. F. 1993. Least Squares Optimization in Multivariate Analysis. DSWO Press, Leiden University.