Type: | Package |
Title: | Fast Implementation of (Local) Population Stratification Methods |
Version: | 1.9 |
Date: | 2022-04-07 |
Author: | Georg Hahn [aut,cre], Sharon M. Lutz [ctb], Christoph Lange [ctb] |
Maintainer: | Georg Hahn <ghahn@hsph.harvard.edu> |
Description: | Fast implementations to compute the genetic covariance matrix, the Jaccard similarity matrix, the s-matrix (the weighted Jaccard similarity matrix), and the (classic or robust) genomic relationship matrix of a (dense or sparse) input matrix (see Hahn, Lutz, Hecker, Prokopenko, Cho, Silverman, Weiss, and Lange (2020) <doi:10.1002/gepi.22356>). Full support for sparse matrices from the R-package 'Matrix'. Additionally, an implementation of the power method (von Mises iteration) to compute the largest eigenvector of a matrix is included, a function to perform an automated full run of global and local correlations in population stratification data, a function to compute sliding windows, and a function to invert minor alleles and to select those variants/loci exceeding a minimal cutoff value. New functionality in locStra allows one to extract the k leading eigenvectors of the genetic covariance matrix, Jaccard similarity matrix, s-matrix, and genomic relationship matrix via fast PCA without actually computing the similarity matrices. The fast PCA to compute the k leading eigenvectors can now also be run directly from 'bed'+'bim'+'fam' files. |
License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |
Imports: | Rcpp (≥ 0.12.13), Rdpack, Matrix, RSpectra, bigsnpr |
RdMacros: | Rdpack |
LinkingTo: | Rcpp, RcppEigen |
RoxygenNote: | 7.1.1 |
NeedsCompilation: | yes |
Packaged: | 2022-04-08 00:26:41 UTC; acer |
Repository: | CRAN |
Date/Publication: | 2022-04-12 18:52:29 UTC |
Computation of the k leading eigenvectors of the covariance matrix directly from a bed+bim+fam file.
Description
Computation of the k leading eigenvectors of the covariance matrix directly from a bed+bim+fam file.
Usage
bed_fastCovEVs(f, k, q = 2)
Arguments
f |
The filename of the bed file (including its extension). The bim and fam files need to be in the same folder and have the same base filename. |
k |
The number of leading eigenvectors. |
q |
The number of power iteration steps (default is |
Value
The k leading eigenvectors of the covariance matrix of m
as a column matrix.
References
R Core Team (2014). R: A Language and Environment for Statistical Computing. R Foundation for Stat Comp, Vienna, Austria.
N. Halko, P.G. Martinsson, and J.A. Tropp (2011). Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Review: 53(2), pp. 217–288.
F. Prive, M. Blum, H. Aschard, B.J. Vilhjalmsson (2022). bigsnpr: Analysis of Massive SNP Arrays. https://cran.r-project.org/package=bigsnpr
Examples
require(locStra)
Computation of the k leading eigenvectors of the genomic relationship matrix, defined in Yang et al. (2011), directly from a bed+bim+fam file.
Description
Computation of the k leading eigenvectors of the genomic relationship matrix, defined in Yang et al. (2011), directly from a bed+bim+fam file.
Usage
bed_fastGrmEVs(f, k, robust = TRUE, q = 2)
Arguments
f |
The filename of the bed file (including its extension). The bim and fam files need to be in the same folder and have the same base filename. |
k |
The number of leading eigenvectors. |
robust |
Flag to indicate if the classic ( |
q |
The number of power iteration steps (default is |
Value
The k leading eigenvectors of the genomic relationship matrix of m
as a column matrix.
References
Yang J, Lee SH, Goddard ME, Visscher PM (2011). GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet, 88(1):76-82.
N. Halko, P.G. Martinsson, and J.A. Tropp (2011). Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Review: 53(2), pp. 217–288.
F. Prive, M. Blum, H. Aschard, B.J. Vilhjalmsson (2022). bigsnpr: Analysis of Massive SNP Arrays. https://cran.r-project.org/package=bigsnpr
Examples
require(locStra)
Computation of the k leading eigenvectors of the Jaccard similarity matrix directly from a bed+bim+fam file.. Note that this computation is only approximate and does not necessarily coincide with the result obtained by extracting the k leading eigenvectors of the Jaccard matrix computed with the function jaccardMatrix
.
Description
Computation of the k leading eigenvectors of the Jaccard similarity matrix directly from a bed+bim+fam file.. Note that this computation is only approximate and does not necessarily coincide with the result obtained by extracting the k leading eigenvectors of the Jaccard matrix computed with the function jaccardMatrix
.
Usage
bed_fastJaccardEVs(f, k, q = 2)
Arguments
f |
The filename of the bed file (including its extension). The bim and fam files need to be in the same folder and have the same base filename. |
k |
The number of leading eigenvectors. |
q |
The number of power iteration steps (default is |
Value
The k leading eigenvectors of the Jaccard matrix of m
as a column matrix.
References
Dmitry Prokopenko, Julian Hecker, Edwin Silverman, Marcello Pagano, Markus Noethen, Christian Dina, Christoph Lange and Heide Fier (2016). Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project. Bioinformatics, 32(9):1366-1372.
N. Halko, P.G. Martinsson, and J.A. Tropp (2011). Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Review: 53(2), pp. 217–288.
F. Prive, M. Blum, H. Aschard, B.J. Vilhjalmsson (2022). bigsnpr: Analysis of Massive SNP Arrays. https://cran.r-project.org/package=bigsnpr
Examples
require(locStra)
Computation of the k leading eigenvectors of the s-matrix (the weighted Jaccard similarity matrix) directly from a bed+bim+fam file. Note that in contrast to the parameters of the function sMatrix
, the choice phased=FALSE
cannot be modified for the fast eigenvector computation. Moreover, inverting the minor allele is not possible when reading directly from external files.
Description
Computation of the k leading eigenvectors of the s-matrix (the weighted Jaccard similarity matrix) directly from a bed+bim+fam file. Note that in contrast to the parameters of the function sMatrix
, the choice phased=FALSE
cannot be modified for the fast eigenvector computation. Moreover, inverting the minor allele is not possible when reading directly from external files.
Usage
bed_fastSMatrixEVs(f, k, Djac = FALSE, q = 2)
Arguments
f |
The filename of the bed file (including its extension). The bim and fam files need to be in the same folder and have the same base filename. |
k |
The number of leading eigenvectors. |
Djac |
Flag to switch between the unweighted ( |
q |
The number of power iteration steps (default is |
Value
The k leading eigenvectors of the s-matrix of m
as a column matrix.
References
Daniel Schlauch (2016). Implementation of the stego algorithm - Similarity Test for Estimating Genetic Outliers. https://github.com/dschlauch/stego
N. Halko, P.G. Martinsson, and J.A. Tropp (2011). Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Review: 53(2), pp. 217–288.
F. Prive, M. Blum, H. Aschard, B.J. Vilhjalmsson (2022). bigsnpr: Analysis of Massive SNP Arrays. https://cran.r-project.org/package=bigsnpr
Examples
require(locStra)
C++ implementation to compute the covariance matrix for a (sparse) input matrix. The function is equivalent to the R command 'cov' applied to matrices.
Description
C++ implementation to compute the covariance matrix for a (sparse) input matrix. The function is equivalent to the R command 'cov' applied to matrices.
Usage
covMatrix(m, useCpp = TRUE, sparse = TRUE)
Arguments
m |
A (sparse) matrix for which the covariance matrix is sought. The input matrix is assumed to be oriented to contain the data for one individual per column. |
useCpp |
Flag to switch between R or C++ implementations. Default is |
sparse |
Flag to switch between purpose-built dense or sparse implementations. Default is |
Value
The covariance matrix of m
.
References
R Core Team (2014). R: A Language and Environment for Statistical Computing. R Foundation for Stat Comp, Vienna, Austria.
Examples
require(locStra)
require(Matrix)
m <- matrix(sample(0:1,15,replace=TRUE),ncol=3)
sparseM <- Matrix(m,sparse=TRUE)
print(covMatrix(sparseM))
Computation of the k leading eigenvectors of the covariance matrix for a (sparse) input matrix.
Description
Computation of the k leading eigenvectors of the covariance matrix for a (sparse) input matrix.
Usage
fastCovEVs(m, k, useCpp = TRUE, sparse = TRUE, q = 2)
Arguments
m |
A (sparse) matrix for which the eigenvectors of its covariance matrix are sought. The input matrix is assumed to be oriented to contain the data for one individual per column. |
k |
The number of leading eigenvectors. |
useCpp |
Flag to switch between R or C++ implementations. Default is |
sparse |
Flag to switch between purpose-built dense or sparse implementations. Default is |
q |
The number of power iteration steps (default is |
Value
The k leading eigenvectors of the covariance matrix of m
as a column matrix.
References
R Core Team (2014). R: A Language and Environment for Statistical Computing. R Foundation for Stat Comp, Vienna, Austria.
N. Halko, P.G. Martinsson, and J.A. Tropp (2011). Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Review: 53(2), pp. 217–288.
Examples
require(locStra)
require(Matrix)
m <- matrix(sample(0:1,100,replace=TRUE),ncol=5)
sparseM <- Matrix(m,sparse=TRUE)
print(fastCovEVs(sparseM,k=2,useCpp=FALSE))
Computation of the k leading eigenvectors of the genomic relationship matrix, defined in Yang et al. (2011), for a (sparse) input matrix.
Description
Computation of the k leading eigenvectors of the genomic relationship matrix, defined in Yang et al. (2011), for a (sparse) input matrix.
Usage
fastGrmEVs(m, k, useCpp = TRUE, sparse = TRUE, robust = TRUE, q = 2)
Arguments
m |
A (sparse) matrix for which the eigenvectors of its genomic relationship matrix are sought. The input matrix is assumed to be oriented to contain the data for one individual per column. |
k |
The number of leading eigenvectors. |
useCpp |
Flag to switch between R or C++ implementations. Default is |
sparse |
Flag to switch between purpose-built dense or sparse implementations. Default is |
robust |
Flag to indicate if the classic ( |
q |
The number of power iteration steps (default is |
Value
The k leading eigenvectors of the genomic relationship matrix of m
as a column matrix.
References
Yang J, Lee SH, Goddard ME, Visscher PM (2011). GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet, 88(1):76-82.
N. Halko, P.G. Martinsson, and J.A. Tropp (2011). Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Review: 53(2), pp. 217–288.
Examples
require(locStra)
require(Matrix)
m <- matrix(sample(0:1,100,replace=TRUE),ncol=5)
sparseM <- Matrix(m,sparse=TRUE)
print(fastGrmEVs(sparseM,k=2,useCpp=FALSE))
Computation of the k leading eigenvectors of the Jaccard similarity matrix for a (sparse) input matrix. Note that this computation is only approximate and does not necessarily coincide with the result obtained by extracting the k leading eigenvectors of the Jaccard matrix computed with the function jaccardMatrix
.
Description
Computation of the k leading eigenvectors of the Jaccard similarity matrix for a (sparse) input matrix. Note that this computation is only approximate and does not necessarily coincide with the result obtained by extracting the k leading eigenvectors of the Jaccard matrix computed with the function jaccardMatrix
.
Usage
fastJaccardEVs(m, k, useCpp = TRUE, sparse = TRUE, q = 2)
Arguments
m |
A (sparse) matrix for which the eigenvectors of its Jaccard matrix are sought. The input matrix is assumed to be oriented to contain the data for one individual per column. |
k |
The number of leading eigenvectors. |
useCpp |
Flag to switch between R or C++ implementations. Default is |
sparse |
Flag to switch between purpose-built dense or sparse implementations. Default is |
q |
The number of power iteration steps (default is |
Value
The k leading eigenvectors of the Jaccard matrix of m
as a column matrix.
References
Dmitry Prokopenko, Julian Hecker, Edwin Silverman, Marcello Pagano, Markus Noethen, Christian Dina, Christoph Lange and Heide Fier (2016). Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project. Bioinformatics, 32(9):1366-1372.
N. Halko, P.G. Martinsson, and J.A. Tropp (2011). Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Review: 53(2), pp. 217–288.
Examples
require(locStra)
require(Matrix)
m <- matrix(sample(0:1,100,replace=TRUE),ncol=5)
sparseM <- Matrix(m,sparse=TRUE)
print(fastJaccardEVs(sparseM,k=2,useCpp=FALSE))
Computation of the k leading eigenvectors of the s-matrix (the weighted Jaccard similarity matrix) for a (sparse) input matrix. Note that in contrast to the parameters of the function sMatrix
, the choice phased=FALSE
cannot be modified for the fast eigenvector computation.
Description
Computation of the k leading eigenvectors of the s-matrix (the weighted Jaccard similarity matrix) for a (sparse) input matrix. Note that in contrast to the parameters of the function sMatrix
, the choice phased=FALSE
cannot be modified for the fast eigenvector computation.
Usage
fastSMatrixEVs(m, k, useCpp = TRUE, sparse = TRUE, Djac = FALSE, q = 2)
Arguments
m |
A (sparse) matrix for which the eigenvectors of its s-matrix are sought. The input matrix is assumed to be oriented to contain the data for one individual per column. |
k |
The number of leading eigenvectors. |
useCpp |
Flag to switch between R or C++ implementations. Default is |
sparse |
Flag to switch between purpose-built dense or sparse implementations. Default is |
Djac |
Flag to switch between the unweighted ( |
q |
The number of power iteration steps (default is |
Value
The k leading eigenvectors of the s-matrix of m
as a column matrix.
References
Daniel Schlauch (2016). Implementation of the stego algorithm - Similarity Test for Estimating Genetic Outliers. https://github.com/dschlauch/stego
N. Halko, P.G. Martinsson, and J.A. Tropp (2011). Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Review: 53(2), pp. 217–288.
Examples
require(locStra)
require(Matrix)
m <- matrix(sample(0:1,100,replace=TRUE),ncol=5)
sparseM <- Matrix(m,sparse=TRUE)
print(fastSMatrixEVs(sparseM,k=2,useCpp=FALSE))
A full scan of the input data m
using a collection of windows given by the two-column matrix windows
. For each window, the data is processed using the function matrixFunction
(this could be, e.g., the covMatrix
function), then the processed data is summarized using the function summaryFunction
(e.g., the largest eigenvector computed with the function powerMethod
), and finally the global and local summaries are compared using the function comparisonFunction
(e.g., the vector correlation with R's function cor
). The function returns a two-column matrix which contains per row the global summary statistics (e.g., the correlation between the global and local eigenvectors) and the local summary statistics (e.g., the correlation between the local eigenvectors of the previous and current windows) for each window.
Description
A full scan of the input data m
using a collection of windows given by the two-column matrix windows
. For each window, the data is processed using the function matrixFunction
(this could be, e.g., the covMatrix
function), then the processed data is summarized using the function summaryFunction
(e.g., the largest eigenvector computed with the function powerMethod
), and finally the global and local summaries are compared using the function comparisonFunction
(e.g., the vector correlation with R's function cor
). The function returns a two-column matrix which contains per row the global summary statistics (e.g., the correlation between the global and local eigenvectors) and the local summary statistics (e.g., the correlation between the local eigenvectors of the previous and current windows) for each window.
Usage
fullscan(m, windows, matrixFunction, summaryFunction, comparisonFunction)
Arguments
m |
A (sparse) matrix for which the full scan is sought. The input matrix is assumed to be oriented to contain the data for one individual per column. |
windows |
A two-column matrix containing per column the windows on which the data is scanned. The windows can be overlapping. The windows can be computed using the function |
matrixFunction |
Function on one matrix argument to process the data for each window (e.g., the covariance matrix). |
summaryFunction |
Function on one argument to summarize the output of the function |
comparisonFunction |
Function on two inputs to compute a comparison measure for the output of the function |
Value
A two-column matrix containing per row the global and local summary statistics for each window. Plotting the correlation data of the returned matrix gives a figure analogously to the figure shown here, which was generated with the example code below.
References
Dmitry Prokopenko, Julian Hecker, Edwin Silverman, Marcello Pagano, Markus Noethen, Christian Dina, Christoph Lange and Heide Fier (2016). Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project. Bioinformatics, 32(9):1366-1372.
Examples
require(locStra)
require(Matrix)
data(testdata)
cor2 <- function(x,y) ifelse(sum(x)==0 | sum(y)==0, 0, cor(x,y))
windowSize <- 10000
w <- makeWindows(nrow(testdata),windowSize,windowSize)
resCov <- fullscan(testdata,w,covMatrix,powerMethod,cor2)
resJac <- fullscan(testdata,w,jaccardMatrix,powerMethod,cor2)
resSMx <- fullscan(testdata,w,sMatrix,powerMethod,cor2)
resGRM <- fullscan(testdata,w,grMatrix,powerMethod,cor2)
resAll <- cbind(resCov[,1], resJac[,1], resSMx[,1], resGRM[,1])
xlabel <- "SNP position"
ylabel <- "correlation between global and local eigenvectors"
mainlabel <- paste("window size",windowSize)
matplot(w[,1],abs(resAll),type="b",xlab=xlabel,ylab=ylabel,ylim=c(0,1),main=mainlabel)
legend("topright",legend=c("Cov","Jaccard","s-Matrix","GRM"),pch=paste(1:ncol(resAll)))
C++ implementation to compute the genomic relationship matrix (grm) for a (sparse) input matrix as defined in Yang et al. (2011).
Description
C++ implementation to compute the genomic relationship matrix (grm) for a (sparse) input matrix as defined in Yang et al. (2011).
Usage
grMatrix(m, useCpp = TRUE, sparse = TRUE, robust = TRUE)
Arguments
m |
A (sparse) matrix for which the genomic relationship matrix is sought. The input matrix is assumed to be oriented to contain the data for one individual per column. |
useCpp |
Flag to switch between R or C++ implementations. Default is |
sparse |
Flag to switch between purpose-built dense or sparse implementations. Default is |
robust |
Flag to indicate if the classic ( |
Value
The genomic relationship matrix of m
.
References
Yang J, Lee SH, Goddard ME, Visscher PM (2011). GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet, 88(1):76-82.
Examples
require(locStra)
require(Matrix)
m <- matrix(sample(0:1,15,replace=TRUE),ncol=3)
sparseM <- Matrix(m,sparse=TRUE)
print(grMatrix(sparseM))
C++ implementation to compute the Jaccard similarity matrix for a (sparse) input matrix.
Description
C++ implementation to compute the Jaccard similarity matrix for a (sparse) input matrix.
Usage
jaccardMatrix(m, useCpp = TRUE, sparse = TRUE)
Arguments
m |
A (sparse) matrix for which the Jaccard similarity matrix is sought. The input matrix is assumed to be oriented to contain the data for one individual per column. |
useCpp |
Flag to switch between R or C++ implementations. Default is |
sparse |
Flag to switch between purpose-built dense or sparse implementations. Default is |
Value
The Jaccard matrix of m
.
References
Dmitry Prokopenko, Julian Hecker, Edwin Silverman, Marcello Pagano, Markus Noethen, Christian Dina, Christoph Lange and Heide Fier (2016). Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project. Bioinformatics, 32(9):1366-1372.
Examples
require(locStra)
require(Matrix)
m <- matrix(sample(0:1,15,replace=TRUE),ncol=3)
sparseM <- Matrix(m,sparse=TRUE)
print(jaccardMatrix(sparseM))
Auxiliary function to generate a two-column matrix of windows to be used in the function 'fullscan'.
Description
Auxiliary function to generate a two-column matrix of windows to be used in the function 'fullscan'.
Usage
makeWindows(len, size, offset)
Arguments
len |
The overall length of the data which is to be scanned in windows. |
size |
The window size. |
offset |
The offset of the generated windows (e.g., if |
Value
A two-column matrix of sliding windows, with one window per row defined through start and end value.
Examples
require(locStra)
print(makeWindows(100,10,5))
C++ implementation of the power method (von Mises iteration) to compute the largest eigenvector of a dense input matrix.
Description
C++ implementation of the power method (von Mises iteration) to compute the largest eigenvector of a dense input matrix.
Usage
powerMethod(m, initvector = 0)
Arguments
m |
Symmetric matrix for which the largest eigenvector is sought. |
initvector |
Optional vector compatible with the input matrix which serves as a starting value for the iteration. Default is zero. |
Value
The largest eigenvector of m
.
References
Richard von Mises and Hilda Pollaczek-Geiringer (1929). Praktische Verfahren der Gleichungsaufloesung. ZAMM Zeitschrift fuer Angewandte Mathematik und Mechanik, 9:152-164.
Examples
require(locStra)
m <- matrix(1:9,3)
print(powerMethod(m))
C++ implementation to compute the s-matrix (the weighted Jaccard similarity matrix) for a (sparse) input matrix as in the 'Stego' package: https://github.com/dschlauch/stego
Description
C++ implementation to compute the s-matrix (the weighted Jaccard similarity matrix) for a (sparse) input matrix as in the 'Stego' package: https://github.com/dschlauch/stego
Usage
sMatrix(m, useCpp = TRUE, sparse = TRUE, Djac = FALSE, phased = FALSE)
Arguments
m |
A (sparse) matrix for which the s-matrix is sought. The input matrix is assumed to be oriented to contain the data for one individual per column. |
useCpp |
Flag to switch between R or C++ implementations. Default is |
sparse |
Flag to switch between purpose-built dense or sparse implementations. Default is |
Djac |
Flag to switch between the unweighted ( |
phased |
Boolean flag to indicate if the input matrix is phased. Default is |
Value
The s-matrix (the weighted Jaccard matrix) of m
.
References
Daniel Schlauch (2016). Implementation of the stego algorithm - Similarity Test for Estimating Genetic Outliers. https://github.com/dschlauch/stego
Examples
require(locStra)
require(Matrix)
m <- matrix(sample(0:1,15,replace=TRUE),ncol=3)
sparseM <- Matrix(m,sparse=TRUE)
print(sMatrix(sparseM))
Auxiliary function to invert minor alleles and to select those variants/loci exceeding a minimal cutoff value.
Description
Auxiliary function to invert minor alleles and to select those variants/loci exceeding a minimal cutoff value.
Usage
selectVariants(m, phased = FALSE, invertMinorAllele = TRUE, minVariants = 0)
Arguments
m |
A (sparse) input matrix. The input matrix is assumed to be oriented to contain the data for one individual per column. |
phased |
Boolean flag to indicate if the input matrix is phased. Default is |
invertMinorAllele |
Boolean flag to indicate if the minor allele should be inverted. Default is |
minVariants |
Cutoff value for minimal number of variants for keeping a locus. Default is |
Value
The processed matrix with pruned variants/loci.
Examples
require(locStra)
m <- matrix(sample(0:1,100,replace=TRUE),ncol=10)
print(selectVariants(m))
Simulated test data.
Description
An artificial dataset containing 100,000 RVs for 100 subjects (one per column). The dataset was generated by resampling the data for chromosome 1 of the 1,000 Genome Project with replacement for randomly chosen subjects.
Usage
data(testdata)
Format
A matrix with 100,000 rows and 100 columns.
References
A global reference for human genetic variation, The 1000 Genomes Project Consortium, Nature 526, 68-74 (01 October 2015) doi:10.1038/nature15393.