Foreword

Unfortunately, creating package vignettes in LaTex using Rmarkdown is difficult because of the fragilty of the interface. After spending several days on this I finally gave in and rewrote them in markdown. This was painful, as I had to spend a lot of time on rewriting the documentation instead of debugging the package. This means that the quality of these vignettes is not what I would have liked. On the positive side, I can now send cool emails with markdown :-)

Intro

spca is an R package for running Sparse Principal Component Analysis. It implements the LS SPCA approach that computes the Least Squares estimates of sparse PCs (Merola (2014)). Unlike other SPCA methods, these solutions maximise the variance of the data explained.

The implementation is completely in R, so it can only be run on small datasets (the limit depends on the hardware used, we were able to solve problems with about 1000 variables in minutes). The package is self contained as it only depends on the library MASS which is part of the basic distribution of R.

Details about LS SPCA and the methodology impmented in the package can be found in (Merola, 2014. arXiv) and in the forthcoming peer reviewed paper.

I had difficulties publishing the LS SPCA paper, possibly because LS SPCA improves on existing methods. This is confirmed by the fact that Technometrics’ chief editor, Dr Qiu, rejected the paper endorsing a report stating that: the LS criterion is a new measure used ad-hoc :-D This on top of a number of blatantly wrong arguments. I am now waiting for the response of a reviewer who asked me to compare the about 20 existing SPCA methods with mine on more datasets (only because I show that my solutions maximise the variance explained and theirs don’t)!

A little math

Principal Component Analysis was developed by Pearson (1901) to attain the components that minimised the LS criterion when approximating the data matrix. If \(X\) is a matrix with n rows of observations on p variables, the PCs are defined by the loadings \(a_j\) as \(t_j = X a_j\). The matrix of \(d<p\) PCs, \(T = XA\), is derived as the regressors that minimise the Residual Sum of Squares. By the principle of the Extra Sum of Squares the components can be constrained to be uncorrelated without loss of optimality. Therefore the PCA problem is obtained by solving: \[ A = \text{arg}\min ||X - TP'||^2 = \text{arg}\max\frac{A'SSA}{A'SA} = \text{arg}\max \sum_{1}^d \frac{a_j'SSa_j}{a_j'Sa_j}\\ \text{subject to}\ a_j'Sa_k = 0,\ j\neq k, \] where \(S\) is the covariance matrix of the \(x\) variables. The terms in the last summation are the variance explained by each component. The solutions are proportional to the eigenvectors of \(S\) corresponding to the eigenvalues taken in nonicreasing order. It is well known that the eigenvectors are mutually uncorrelated.

Hotelling (1933) gives the PCs’ loadings as the eigenvectors of \(S\) with unit Euclidean norm. Using this normalisation the maximisation of the variance explained by each component simplifies to \[ A = \text{arg}\max \sum_{1}^d {a_j'Sa_j}\\ \text{subject to}\ a_j'Sa_k = \delta_{jk}, \] where \(\delta_{jk} = 1\) if \(j=k\) and \(\delta_{jk} = 0\) otherwise.

Because of its simplicity, Hotelling’s derivation has been adopted for popularizing PCA among pratictioners. This choice was unfortunate because the original objective of minimising the LS criterion has been somewhat forgotten. However, other than in Person’s original paper, the LS derivation is given in several books and papers cited in my paper ( e.g. tenBerge (1993) and Izenman (2008)).

When cardinality constraints (also called \(L_0\) norm constraints) are imposed on the original PCA problem, the loadings are no longer eigenvectors of \(S\). Therefore, Hotelling’s simplification is no longer equivalent to the variance explained. Furthermore, by the Cauchy-Schwartz inequality: \[ \frac{a_j'SSa_j}{a_j'Sa_j} \geq \frac{a_j'Sa_j}{a_j'a_j} \] for any square matrix S, with equality if and only if the vectors \(a_j\) are eigenvectors of \(S\). Therefore, the components with maximal variance are suboptimal for explaining the variance.

Other SPCA methods apply cardinality constraints to Hotelling’s definition, hence not optimising the variance explained. Instead, in LS SPCA we derive the loadings from to Pearson’s LS optimisation adding cardinality constraints.

The uncorrelated LS SPCA solutions are constrained Reduced Rank Regression solutions (see Izenman (1975), for the unconstrained solutions). The uncorrelatedness constraints limit the amount of variance explained by the solutions and require that the loadings have cardinality not smaller than their rank. Even though uncorrelated components are easier to interpret, in some cases uncorrelated ones can be useful. Therefore, we also provide correlated sparse loadings that approximately minimise the LS criterion.

Optimisation Models

Finding the optimal indices for an spca solution is an intractable NP-hard problem.

Therefore, we find the solutions through two greedy algorithms: Branch-and-Bound (BB) and Backward Elimination (BE).

Use of the package

SPCA aims to obtain interpretable solutions

Interpretability is not univocally defined. Hence, for a problem there exist a number of competing solutions. In Factor Analysis literature there is plenty of discussion about the definition of interpretable and simple solution (as qualities and mathematical functions).

Therefore, for a given problem there usually exist several competing simple and interpretable solutions.

spca is implemented as an exploratory data analysis tool

The cardinality of the components can be chosen interactively after inspecting trace and plots of solutions of different cardinality.

Solutions can be also computed non-interactively so as to:

spca contains methods for plotting and printing the solutions and for comparing different ones. In this way the solution can be chosen with respect to several different characteristicsm which cannot be all included in a function at the same time.

spca can be helpful also in a confirmatory stage of the analysis, in fact * the components can be constrained to be made up of only a subset of the variables.

Functions

The workhorse of the package is the function spca, which computes the optimal solutions for a given set of indices.

It is called simply with a list of indices and the the flags for correlated or uncorrelated components (one for each component, if necessary)

spca(S, ind, unc = TRUE)

The functions spcabb and spcabe implement the BB and BE searches, respectively.

spcabb(S, card, unc = TRUE, startind, excludeload = FALSE, nvexp = FALSE, msg = TRUE)
spcabe(S, nd = FALSE, ndbyvexp = FALSE, mincard = NULL, thresh = FALSE, threshvar = FALSE, threshvaronPC = FALSE, perc = TRUE, 
    unc = TRUE, trim = 1, reducetrim = TRUE, startind = NULL, excludeload = FALSE, diag = FALSE, choosecard = NULL, eps = 1e-04, 
    msg = TRUE)

Withhelp(spcabb) and help(spcabe) you will find examples of using spca and the utilities. In the vignettes(spca) you will find a more complete example and details on the methods. These are available also in the Manual and a more complete example is given in the Advanced Example vignette.

There is also the function ‘pca’ which computes the PCA solutions and returns an spca object. The function is called as:

pca(S, nd, only.values = FALSE, screeplot = FALSE, kaiser.print = FALSE)

Methods

The package contains methods for plotting, printing and comparing spca solutions. These are:

Minimal Example

The naming of the arguments in R is not simple, mainly because different syntaxes have been used over the years. I tried to give meaningful names starting differently so that R’s useful feature of partial matching the arguments can be exploited. In the following wxample I sometime use partial arguments names.

library(spca)
cat(paste("loaded spca version:", packageVersion("spca")))
#> loaded spca version: 0.6.0
data(bsbl)

#- ordinary PCA
bpca = pca(bsbl, screeplot = TRUE, kaiser.print = TRUE)

screeplot

#> [1] "number of eigenvalues larger than 1 is 3"
#- sparse PCA with minimal contribution 25%
bbe1 <- spcabe(bsbl, nd = 4, thresh = 0.25, unc = FALSE)

#- summary output
summary(bbe1)
#>            Comp1 Comp2 Comp3 Comp4
#> PVE        44.4% 24.9% 10.3% 5.6% 
#> PCVE       44.4% 69.3% 79.6% 85.2%
#> PRCVE      96.4% 96.1% 96.6% 97.1%
#> Card       2     3     3     1    
#> Ccard      2     5     8     9    
#> PVE/Card   22.2% 8.3%  3.4%  5.6% 
#> PCVE/Ccard 22.2% 13.9% 10%   9.5% 
#> Converged  0     0     0     0    
#> MinCont    31.5% 26.3% 28.2% 100%
#-# Explaining over 96% of the PCs' variance with 2, 3, 3 and 1 variables.

#- print percentage contributions
bbe1
#> Percentage Contributions
#>        Comp1 Comp2 Comp3 Comp4
#> TAB_86  31.5  35.2            
#> HR_86               28.2      
#> RUN_86        26.3            
#> RUN          -38.5            
#> RUNB    68.5                  
#> PO_86                      100
#> ASS_86             -40.9      
#> ERR_86             -30.9      
#>        ----- ----- ----- -----
#> PCVE   44.4  69.3  79.6  85.2 
#> 
#-# Simple combinations of offensive play in career and season are most important. Defensive play in season appears only in 3rd component.

#- The contributions can be printed one by one using the descriptive names in `bsbl_label`
data(bsbl_labels, package = "spca")
head(bsbl_labels)
#>   short.name                  label
#> 1     TAB_86   times at bat in 1986
#> 2    HIT_86            hits in 1986
#> 3      HR_86      home runs in 1986
#> 4     RUN_86           runs in 1986
#> 5      RB_86 runs batted-in in 1986
#> 6     WAL_86          walks in 1986
showload(bbe1, variablesnames = bsbl_labels[,2])
#> [1] "Component 1"
#>             times at bat in 1986 runs batted-in during his career 
#>                            31.5%                            68.5% 
#>  
#> [1] "Component 2"
#>   times at bat in 1986           runs in 1986 runs during his career 
#>                  35.2%                  26.3%                 -38.5% 
#>  
#> [1] "Component 3"
#> home runs in 1986   assists in 1986    errors in 1986 
#>             28.2%            -40.9%            -30.9% 
#>  
#> [1] "Component 4"
#> put outs in 1986 
#>             100% 
#> 
#- plot solution
plot(bbe1, plotloadvsPC = TRUE, pc = bpca, mfr = 2, mfc = 2, 
               variablesnames = as.character(bsbl_labels[,2]))

#-# Explaining the variance pretty closely to PCA with much fewer variables.

#

Installing the package

The package development is in the GitHub repository GitHub repo

install.packages("spca")
if (packageVersion("spca") < 0.4.0) {
  install.packages("devtools")
}
devtools::install_github("merolagio/spca")

Future releases

This is the first release and will surely contain some bugs, even though I tried to test it. Please do let me know if you find any or can suggest improvements. Please use the Github tools for submitting bugs Bug report or contributions.

For now most of the plots are produced with the basic plotting functions. In a later release I will produce the plots with ggplot2 (requires learning the package better).

The code is implemented in R, so it will not work for large datasets. I have in mind to develop C routines at least for the matrix algebra. Anybody willing to help, please, let me know.

References

Farcomeni, Alessio. 2009. “An Exact Approach to Sparse Principal Component Analysis.” Computational Statistics 24 (4): 583–604.

Hotelling, H. 1933. “Analysis of a Complex of Statistical Variables with Principal Components.” Journal of Educational Psychology 24: 498–520.

Izenman, A. J. 1975. “Reduced-Rank Regression for the Multivariate Linear Model.” Journal of Multivariate Analysis 5 (2): 248–64.

———. 2008. Modern Multivariate Statistical Techniques : Regression, Classification, and Manifold Learning. Springer Texts in Statistics. Springer New York.

Merola, G. 2014. “Least Squares Sparse Principal Component Analysis: A Backward Elimination Approach to Attain Large Loadings.” To Appear in Australian & New Zealand J. Stats. – Preprint available at http://arxiv.org/abs/1406.1381.

Pearson, K. 1901. “On lines and planes of closest fit to systems of points in space.” Philosophical Magazine 2 (6): 559–72.

tenBerge, J. M. F. 1993. Least Squares Optimization in Multivariate Analysis. DSWO Press, Leiden University.