The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
library(randPedPCA)
#> Loading required package: spam
#> Spam version 2.11-1 (2025-01-20) is loaded.
#> Type 'help( Spam)' or 'demo( spam)' for a short introduction
#> and overview of this package.
#> Help for individual functions is also obtained by adding the
#> suffix '.spam' to the function name, e.g. 'help( chol.spam)'.
#>
#> Attaching package: 'spam'
#> The following objects are masked from 'package:base':
#>
#> backsolve, forwardsolve
#> Loading required package: Matrix
#>
#> Attaching package: 'Matrix'
#> The following object is masked from 'package:spam':
#>
#> det
#> Loading required package: pedigreeTools
Principal component analysis (PCA) is a method for summarising the information in the columns of a matrix. Because pedigrees can be represented as matrices, they can, in principle, be subjected to PCA.
When thinking about PCA, one usually begins with a data matrix, such as the iris dataset, where rows represent individuals (or observations) and columns contain traits (or features or markers). However, when dealing with a pedigree-derived matrix, there is no data matrix of this kind. For instance, a pedigree’s additive relationship matrix \(A\) is a covariance matrix. Fortunately, PCA can also be achieved by starting from a covariance matrix. That is what the randPedPCA package does.
In this vignette, we demonstrate how to use the package randPedPCA and how to carry out PCA on your own pedigree or pedigree matrix.
Using an example dataset that ships with
randPedPCA
, called pedMeta
,
we first generate a pedigree object. This is done using the function
pedigree
from the dependence
pedigreeTools
. Then we use
rppca
to perform PCA. Ignore the deprecation warning about
matrix conversion if one comes up. It is generated by a dependency.
# generate a pedigree object from the example dataset provided
ped <- pedigree(pedMeta$fid,
pedMeta$mid,
pedMeta$id)
pc01 <- rppca(ped, center=T)
#> 'as(<dtTMatrix>, "dtCMatrix")' is deprecated.
#> Use 'as(., "CsparseMatrix")' instead.
#> See help("Deprecated") and help("Matrix-deprecated").
plot(pc01, col = pedMeta$population)
#> Subsetting colours using the individual index.
Instead of starting from a pedigree, we can also use the pedigree’s
\(L^{-1}\) matrix as the input to
rppca
. This matrix can be generated using the
getLInv
function from
pedigreeTools
. Note that
rppca
expects a sparse matrix in spam
format.
This is why we wrap sparse2spam
around
getLInv
:
li <- sparse2spam(getLInv(ped))
pc02 <- rppca(li, center = T)
plot(pc02, col = pedMeta$population)
#> Subsetting colours using the individual index.
For simplicity, we decided to ship this \(L^{-1}\) matrix as an example dataset. The
PCA result is identical when using this object pedLInv
:
pc03 <- rppca(pedLInv, center = T)
plot(pc03, col = pedMeta$population)
#> Subsetting colours using the individual index.
It is usually recommended that PCA is run on centred data, which is
what we did in the examples above. If the data are not centred, PC1
tends to capture the deviation from mean = 0 of each data column. In the
context of pedigree PCA, this usually means that PC1 accounts for a high
proportion of the total variance and it usually aligns with time or
generation number.
The plots above show an un-centred and a cented PCA of the same simulated pedigree. Two breeds (red and black) are generated from an initial population and are propagated individually for 10 generations. Then, a hybrid population (green) is made, which keeps receiving geneflow from both red and black for another 10 generations. Without centring (left), PC1 aligns well with the generation number, a patter that we often observe. With centring (right), PC1 reflects the differentiation between the breeds.
When running PCA, users often care about how much variance and what
proportion of the total variance is accounted for by individual
principal components. This can be queried using the summary
function on a PCA object. For objects generated by rppca
,
summary
will return the standard deviation of each PC.
There will be two additional, sowing the proportion of total variance
and their cumulative sum, unless if the input to rppca
was
an \(L^{-1}\) matrix. R users may be
used to this behaviour from the built-in functions prcomp
and princomp
.
The reason that the variance proportions and cumulative sums are not
shown by default is that these numbers are costly to estimate from a
large \(L^{-1}\) matrix, compared to
running the actual PCA. If the total variance is known, it can be
specified when running rppca
using the argument
totVar
. Below, we describe how the total variance can be
estimated.
In the context of pedigrees, the total variance of the data corresponds to the variance of the pedigree’s \(L\) matrix, which is equivalent to the trace of the corresponding \(A\) matrix. Both matrices are expensive to compute. If the PCA is ran without centring, then the total variance also equals the sum of (the inbreeding values + 1), which can be computed directly from the pedigree. With centring, this does not work and the total variance will be lower. It has to be estimated. Here, we implemented the Hutch++ trace estimation algorithm for this purpose.
# True total variance computed via the pedigree's inbreeding values
sum(inbreeding(ped) + 1) # "ped" was defined at the top
#> [1] 3521.534
# Now estimate the total variance (the trace of A) from the corresponding
# L inverse matrix using Hutch++
li <- sparse2spam(getLInv(ped)) # generate L inverse and convert to spam format
set.seed(123345) # set random seed as Hutch++ uses random numbers
hutchpp(li) # Hutch++ with default settings
#> [1] 3441.592
#> attr(,"center")
#> [1] FALSE
# for higher accuracy increase num_queries (increases running time)
hutchpp(li,num_queries = 100)
#> [1] 3521.944
#> attr(,"center")
#> [1] FALSE
We recommend that users experiment with the value of
num_queries
to optimise accuracy. Using the the default
value of 10 should work within reasonable time even for a large pedigree
of 1M individuals.
Hutch++ can also be used to compute the total variance of a centred dataset. We demonstrate this here using a small simulated example, where we can perform naive centring for comparison.
# Get L, the "data matrix" of a pedigree
ll <- getL(ped)
# centre L (because L is upper triangular, we centre the rows)
llc <- apply(ll, 1, function(x) x - mean(x))
# compute additive relationship matrix of the centred data
ac <- llc %*% t(llc)
sum(diag(ac)) # exact value (would be too expensive to compute for a large pedigree)
#> [1] 2694.038
# Obtain centred estimate from L inverse using Hutch++
li <- sparse2spam(getLInv(ped)) # generate L inverse and convert to spam format
set.seed(123345) # set random seed as Hutch++ uses random numbers
hutchpp(li,num_queries = 100, center = T) # estimate
#> [1] 2673.6
#> attr(,"center")
#> [1] TRUE
If you want to use a hutchpp
estimate when
running rppca
, make sure that the
center
values are identical. E.g., when running a PCA
with centring, make sure that you have run hutchpp
with centring as well!
The summary of a PCA generated from an \(L^{-1}\) matrix by default only shows the standard deviation accounted for by each PC. It also throws a warning saying that there is no information about the total variance:
summary(pc02)
#> Warning in summary.rppca(pc02): Input does not contain information on variance components. Consider
#> running rppca on a pedigree object or supplying an estimate of the total
#> variance of the data.
#> Importance of components:
#> PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
#> Standard deviation 0.5395 0.2449 0.1562 0.1339 0.09454 0.09086 0.08536 0.07596
#> PC9 PC10
#> Standard deviation 0.07245 0.06749
When starting from a pedigree and with no centring, then variance proportions and cumulative sums are returned as well:
pc04 <- rppca(ped, center=F)
summary(pc04)
#> Importance of components:
#> PC1 PC2 PC3 PC4 PC5 PC6 PC7
#> Standard deviation 0.6022 0.5359 0.16342 0.15344 0.103227 0.096252 0.083479
#> Proportion of Variance 0.2728 0.2161 0.02009 0.01771 0.008016 0.006969 0.005242
#> Cumulative proportion 0.2728 0.4889 0.50899 0.52671 0.534720 0.541690 0.546930
#> PC8 PC9 PC10
#> Standard deviation 0.079495 0.074385 0.068442
#> Proportion of Variance 0.004754 0.004162 0.003524
#> Cumulative proportion 0.551690 0.555850 0.559370
If the user has an estimate of the total variance,
this can be supplied when calling rppca
on a matrix object
and the variance proportions and cumulative sums are returned. In
practice, this should rarely be needed as the user will probably have
access to the pedigree, from which the total variance is automatically
computed (example above).
pc05 <- rppca(pedLInv, center=F, totVar=3521.534)
summary(pc05)
#> Importance of components:
#> PC1 PC2 PC3 PC4 PC5 PC6 PC7
#> Standard deviation 0.6022 0.5359 0.16342 0.15344 0.10293 0.096631 0.087215
#> Proportion of Variance 0.2728 0.2161 0.02009 0.01771 0.00797 0.007024 0.005722
#> Cumulative proportion 0.2728 0.4889 0.50899 0.52670 0.53467 0.541700 0.547420
#> PC8 PC9 PC10
#> Standard deviation 0.078210 0.072653 0.07145
#> Proportion of Variance 0.004601 0.003971 0.00384
#> Cumulative proportion 0.552020 0.555990 0.55983
When running rppca
on a pedigree object without
centring, totVar can be used to override the value computed from the
pedigree. This triggers a warning. In this case here, it also causes
variance proportions > 1 which do not make sense:
pc07 <- rppca(ped, center=F, totVar=123)
#> Warning in rppca.pedigree(ped, center = F, totVar = 123): Using specified value of 123 for the total variance
#> instead of the value computed from the pedigree, which was 3521.53374702136
summary(pc07)
#> Importance of components:
#> PC1 PC2 PC3 PC4 PC5 PC6 PC7
#> Standard deviation 0.6022 0.5359 0.1634 0.1534 0.1021 0.09601 0.08621
#> Proportion of Variance 7.8114 6.1862 0.5749 0.5071 0.2244 0.19853 0.16008
#> Cumulative proportion 7.8114 13.9976 14.5725 15.0796 15.3040 15.50250 15.66258
#> PC8 PC9 PC10
#> Standard deviation 0.07924 0.06710 0.05855
#> Proportion of Variance 0.13524 0.09697 0.07384
#> Cumulative proportion 15.79782 15.89478 15.96862
Centred PCA
Variance proportions are automatically computed if the input is a
pedigree.
pc06 <- rppca(ped, center=T)
summary(pc06)
#> Importance of components:
#> PC1 PC2 PC3 PC4 PC5 PC6
#> Standard deviation 0.5395 0.24494 0.15625 0.13398 0.096318 0.091287
#> Proportion of Variance 0.2862 0.05899 0.02401 0.01765 0.009122 0.008194
#> Cumulative proportion 0.2862 0.34523 0.36924 0.38689 0.396010 0.404200
#> PC7 PC8 PC9 PC10
#> Standard deviation 0.084037 0.075591 0.072224 0.069821
#> Proportion of Variance 0.006944 0.005618 0.005129 0.004793
#> Cumulative proportion 0.411150 0.416770 0.421900 0.426690
If the input is an \(L^{-1}\) matrix, variance proportions are only computed if an estimate of the total variance is supplied.
# No proportions shown by default
pc08 <- rppca(pedLInv, center=T)
summary(pc08)
#> Warning in summary.rppca(pc08): Input does not contain information on variance components. Consider
#> running rppca on a pedigree object or supplying an estimate of the total
#> variance of the data.
#> Importance of components:
#> PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
#> Standard deviation 0.5395 0.2449 0.1562 0.134 0.09676 0.09105 0.08391 0.07764
#> PC9 PC10
#> Standard deviation 0.07239 0.06893
# Only when estimate is supplied
pc09 <- rppca(pedLInv, center=T, totVar=2673.6)
summary(pc09)
#> Importance of components:
#> PC1 PC2 PC3 PC4 PC5 PC6 PC7
#> Standard deviation 0.5395 0.24494 0.15625 0.1340 0.096303 0.089055 0.086346
#> Proportion of Variance 0.2884 0.05945 0.02419 0.0178 0.009189 0.007858 0.007387
#> Cumulative proportion 0.2884 0.34787 0.37206 0.3899 0.399050 0.406910 0.414290
#> PC8 PC9 PC10
#> Standard deviation 0.078623 0.071970 0.061636
#> Proportion of Variance 0.006125 0.005132 0.003764
#> Cumulative proportion 0.420420 0.425550 0.429320
We generated a plot
method for rppca
objects. If the variance proportions are known, they will be displayed
in the axis labels by default. Note that if the total variance was
estimated as in this example, where it was supplied when calling
rppca
, then the proportions displayed come with some
uncertainty.
# PC1 and PC2 are plotted by default
plot(pc06, col=pedMeta$population, main="My pedigree PCA")
#> Subsetting colours using the individual index.
# plot PC1 and PC3 instead
plot(pc06, dims=c(1,3), col=pedMeta$population, main="My pedigree PCA,\ncustom PCs shown")
#> Subsetting colours using the individual index.
Because plot.rppca
relies on R’s relatively slow base
plotting function, it is useful to down-sample the individuals to be
plotted. Thus, plot.rppca
will automatically down-sample to
10,000 individuals if there are more in the dataset. Technically there
is no down-sampling and all the data are retained. Instead, an integer
vector is added in the ds
slot of the rppca
object. That vector is automatically used by plot.rppca
for
individuals and any colours specified by the col
parameter.
The value of col
should thus have the same length as there
were individuals in the full dataset. To adjust the down-sampling, one
can use the parameter to
of plot.rppca
. The
value of to
is passed to the function dspc
,
which carries out the down-sampling/vector generation. See
?dspc
for details. To plot wihthout any down-sampling, do
plot(myPc, to=NA)
.
One potential issue is that every time plot.rppca
is
run, a new set of individuals is sampled, so that repeated plots of the
same dataset will look different. To retain a specific down-sampling,
one can run dspc
outside of plot.rppca
, like
pcDown <- dspc(pc)
. Calling plot(pcDown)
repeatedly will generate the same plot every time.
Users will mainly want to analyse their own empirical or simulated
datasets. This should be straightforward if the data are in pedigree
format, which we recommend. In this case, rppca
can be
called directly in such an object as demonstrated above.
If the user has an \(L^{-1}\)
matrix, this needs to be converted into a spam
object, a
specific sparse matrix format. We provide a wrapper called
importLinv
for reading matrices in Matrix Market format
from disk and converting these into spam
format. If a
sparse \(L^{-1}\) matrix is already
available in in the workspace, sparse2spam
should work to
convert it to spam
format. Let us know if this does not
work for you (and please share a reproducible example).
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.