The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
genio
The genio
(GenIO = Genetics I/O) package aims to facilitate reading and writing genetics data. The focus of this vignette is processing Plink BED/BIM/FAM files.
There are some limited alternatives for reading and/or writing BED files in R, which are slower and harder to use, which motivated me to write this package. Here we make direct comparisons to those packages, to illustrate the advantages of genio
.
Let’s begin by creating a large genotype matrix with completely random data, and with missing values so this becomes non-trivial to read and write.
# Data dimensions.
# Choose non-multiples of 4 to test edge cases of BED parsers.
# Number of loci.
<- 10001
m_loci # Number of individuals
<- 1001
n_ind # Overall allele frequency
# (we don't do any actual genetics in this vignette,
# so just set to a reasonable value)
<- 0.5
p # Missingness rate
<- 0.1
miss
# Total number of genotypes
<- n_ind * m_loci
n_data # Draw random genotypes from Binomial
<- rbinom( n_data, 2, p)
X # Add missing values
sample(n_data, n_data * miss) ] <- NA
X[ # Turn into matrix
<- matrix(X, nrow = m_loci, ncol = n_ind)
X
# Inspect the first 10 individuals at the first 10 loci
1:10, 1:10]
X[#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 1 1 1 1 2 2 NA 1 2 1
#> [2,] 0 2 0 0 0 1 1 0 2 1
#> [3,] 1 1 1 1 0 0 2 2 1 1
#> [4,] NA 2 NA 1 1 2 0 2 1 NA
#> [5,] 0 0 NA 1 0 1 2 1 NA 1
#> [6,] 0 1 1 1 NA 1 1 1 2 1
#> [7,] 0 0 2 2 1 1 0 NA NA 2
#> [8,] 1 2 1 2 1 2 1 1 0 1
#> [9,] 2 1 1 NA 1 0 2 0 NA 1
#> [10,] 0 0 0 1 2 1 2 2 2 1
To create annotation tables that look a bit more interesting than the defaults, let us create some slightly more realistic values. Here we use our first genio
function!
library(genio)
First we create and edit the locus annotations table.
# We have to specify the number of loci
<- make_bim( n = m_loci )
bim
# Inspect the default values
bim#> # A tibble: 10,001 × 6
#> chr id posg pos alt ref
#> <dbl> <int> <dbl> <int> <dbl> <dbl>
#> 1 1 1 0 1 2 1
#> 2 1 2 0 2 2 1
#> 3 1 3 0 3 2 1
#> 4 1 4 0 4 2 1
#> 5 1 5 0 5 2 1
#> 6 1 6 0 6 2 1
#> 7 1 7 0 7 2 1
#> 8 1 8 0 8 2 1
#> 9 1 9 0 9 2 1
#> 10 1 10 0 10 2 1
#> # … with 9,991 more rows
# Let's add the "chr" prefix to the chromosome values,
# so we recognize them when we see them later.
$chr <- paste0('chr', bim$chr)
bim# Make SNP IDs look like "rs" IDs
$id <- paste0('rs', bim$id)
bim# Make positions 1000 bigger
$pos <- bim$pos * 1000
bim# Select randomly between Cs and Gs for the reference alleles
$ref <- sample(c('C', 'G'), m_loci, replace = TRUE)
bim# Select randomly between As and Ts for the alternative alleles
$alt <- sample(c('A', 'T'), m_loci, replace = TRUE)
bim
# Inspect the table with our changes
bim#> # A tibble: 10,001 × 6
#> chr id posg pos alt ref
#> <chr> <chr> <dbl> <dbl> <chr> <chr>
#> 1 chr1 rs1 0 1000 T C
#> 2 chr1 rs2 0 2000 A C
#> 3 chr1 rs3 0 3000 T C
#> 4 chr1 rs4 0 4000 T G
#> 5 chr1 rs5 0 5000 T G
#> 6 chr1 rs6 0 6000 A C
#> 7 chr1 rs7 0 7000 T G
#> 8 chr1 rs8 0 8000 A G
#> 9 chr1 rs9 0 9000 A C
#> 10 chr1 rs10 0 10000 T C
#> # … with 9,991 more rows
Now we similarly create and edit the individual annotations table.
# Specify the number of individuals
<- make_fam( n = n_ind )
fam
# Inspect the default values
fam#> # A tibble: 1,001 × 6
#> fam id pat mat sex pheno
#> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 0 0 0 0
#> 2 2 2 0 0 0 0
#> 3 3 3 0 0 0 0
#> 4 4 4 0 0 0 0
#> 5 5 5 0 0 0 0
#> 6 6 6 0 0 0 0
#> 7 7 7 0 0 0 0
#> 8 8 8 0 0 0 0
#> 9 9 9 0 0 0 0
#> 10 10 10 0 0 0 0
#> # … with 991 more rows
# Add prefixes to families and IDs to recognize them later
$fam <- paste0('fam', fam$fam)
fam$id <- paste0('id', fam$id)
fam# Sex values are usually 1 and 2
$sex <- sample(1:2, n_ind, replace = TRUE)
fam# Let's make phenotypes continuous.
# Draw independently from Standard Normal.
$pheno <- rnorm(n_ind)
fam# Let's leave maternal and paternal IDs as missing
# Inspect again
fam#> # A tibble: 1,001 × 6
#> fam id pat mat sex pheno
#> <chr> <chr> <dbl> <dbl> <int> <dbl>
#> 1 fam1 id1 0 0 1 0.150
#> 2 fam2 id2 0 0 1 1.09
#> 3 fam3 id3 0 0 2 -0.0669
#> 4 fam4 id4 0 0 1 -0.968
#> 5 fam5 id5 0 0 2 -1.86
#> 6 fam6 id6 0 0 1 -1.31
#> 7 fam7 id7 0 0 2 0.0666
#> 8 fam8 id8 0 0 2 -0.234
#> 9 fam9 id9 0 0 2 -0.187
#> 10 fam10 id10 0 0 2 -2.09
#> # … with 991 more rows
Lastly, let’s copy the locus and individual IDs as row and column names of the genotype matrix, respectively. Although this step is not required, it is encouraged for consistency (if present, values are checked when writing files).
# Add column and row names from bim and fam tables we just created.
rownames(X) <- bim$id
colnames(X) <- fam$id
# Inspect again the first 10 individuals and loci
1:10, 1:10]
X[#> id1 id2 id3 id4 id5 id6 id7 id8 id9 id10
#> rs1 1 1 1 1 2 2 NA 1 2 1
#> rs2 0 2 0 0 0 1 1 0 2 1
#> rs3 1 1 1 1 0 0 2 2 1 1
#> rs4 NA 2 NA 1 1 2 0 2 1 NA
#> rs5 0 0 NA 1 0 1 2 1 NA 1
#> rs6 0 1 1 1 NA 1 1 1 2 1
#> rs7 0 0 2 2 1 1 0 NA NA 2
#> rs8 1 2 1 2 1 2 1 1 0 1
#> rs9 2 1 1 NA 1 0 2 0 NA 1
#> rs10 0 0 0 1 2 1 2 2 2 1
Let’s write this random data to a file. This mode is intended for simulated data, generating dummy BIM and FAM files in the process and writing out all three.
# Will delete at the end of the vignette
<- tempfile('vignette-random-data')
file_plink
# Write genotypes, along with the BIM and FAM files we created.
# Omiting them would result in writing the original dummy version of these tables,
# before we edited them.
<- system.time(
time_write_genio write_plink(file_plink, X, bim, fam)
)#> Writing: /tmp/RtmpLWRGh3/vignette-random-data71391bd2804.bed
#> Writing: /tmp/RtmpLWRGh3/vignette-random-data71391bd2804.bim
#> Writing: /tmp/RtmpLWRGh3/vignette-random-data71391bd2804.fam
time_write_genio#> user system elapsed
#> 0.206 0.090 0.261
Here we demonstrate how easy it is to read the data back. To compare to other packages, we shall time loading all this data.
# Read the data back in memory.
# Time this step
<- system.time(
time_read_genio <- read_plink(file_plink)
data_genio
)#> Reading: /tmp/RtmpLWRGh3/vignette-random-data71391bd2804.bim
#> Reading: /tmp/RtmpLWRGh3/vignette-random-data71391bd2804.fam
#> Reading: /tmp/RtmpLWRGh3/vignette-random-data71391bd2804.bed
time_read_genio#> user system elapsed
#> 0.082 0.011 0.093
# Inspect the data we just read back
# The same random genotypes (first 10 individuals and loci, now with row and column names):
$X[1:10, 1:10]
data_genio#> id1 id2 id3 id4 id5 id6 id7 id8 id9 id10
#> rs1 1 1 1 1 2 2 NA 1 2 1
#> rs2 0 2 0 0 0 1 1 0 2 1
#> rs3 1 1 1 1 0 0 2 2 1 1
#> rs4 NA 2 NA 1 1 2 0 2 1 NA
#> rs5 0 0 NA 1 0 1 2 1 NA 1
#> rs6 0 1 1 1 NA 1 1 1 2 1
#> rs7 0 0 2 2 1 1 0 NA NA 2
#> rs8 1 2 1 2 1 2 1 1 0 1
#> rs9 2 1 1 NA 1 0 2 0 NA 1
#> rs10 0 0 0 1 2 1 2 2 2 1
# The locus annotations
$bim
data_genio#> # A tibble: 10,001 × 6
#> chr id posg pos alt ref
#> <chr> <chr> <dbl> <int> <chr> <chr>
#> 1 chr1 rs1 0 1000 T C
#> 2 chr1 rs2 0 2000 A C
#> 3 chr1 rs3 0 3000 T C
#> 4 chr1 rs4 0 4000 T G
#> 5 chr1 rs5 0 5000 T G
#> 6 chr1 rs6 0 6000 A C
#> 7 chr1 rs7 0 7000 T G
#> 8 chr1 rs8 0 8000 A G
#> 9 chr1 rs9 0 9000 A C
#> 10 chr1 rs10 0 10000 T C
#> # … with 9,991 more rows
# The individual annotations
$fam
data_genio#> # A tibble: 1,001 × 6
#> fam id pat mat sex pheno
#> <chr> <chr> <chr> <chr> <int> <dbl>
#> 1 fam1 id1 0 0 1 0.150
#> 2 fam2 id2 0 0 1 1.09
#> 3 fam3 id3 0 0 2 -0.0669
#> 4 fam4 id4 0 0 1 -0.968
#> 5 fam5 id5 0 0 2 -1.86
#> 6 fam6 id6 0 0 1 -1.31
#> 7 fam7 id7 0 0 2 0.0666
#> 8 fam8 id8 0 0 2 -0.234
#> 9 fam9 id9 0 0 2 -0.187
#> 10 fam10 id10 0 0 2 -2.09
#> # … with 991 more rows
# Quickly test that the inputs and outputs are identical.
# Genotypes have NAs, so we have to compare this way.
stopifnot( all( X == data_genio$X, na.rm = TRUE) )
stopifnot( bim == data_genio$bim )
# FAM has mixed content (chars, ints, and doubles).
# First 5 columns should be exact match:
stopifnot( fam[,1:5] == data_genio$fam[,1:5] )
# Exact equality may fail for pheno due to precisions, so test this way instead:
stopifnot( max(abs(fam$pheno - data_genio$fam$pheno)) < 1e-4 )
One drawback of genio
and other related approaches is high memory consumption (see more on that at the end of this vignette). It is entirely possible that an attempt to parse a genotype matrix into memory will fail with an “out of memory” error message. Let’s estimate memory usage here.
Genotypes are stored as integers by genio
, which in R on a modern machine (64 bit architecture) consumes 4 bytes! So an \(n \times m\) matrix takes up at least \(4 n m\) bytes (an R matrix contains an additional constant overhead, which does not depend on \(n,m\) and is relatively small for large matrices).
To get an idea of how much this is, let’s assume a standard genotyping array, where \(m\) is about half a million. In this scenario, we get a simple rule of thumb that every 1000 individuals consume a little less than 2G:
# Constants
<- 4
bytes_per_genotype <- 1024 ^ 3
bytes_per_gb # Example data dimensions
<- 1000
num_ind <- 500000
num_loci # Gigabytes per 1000 individuals for a typical genotyping array
* num_ind * num_loci / bytes_per_gb
bytes_per_genotype #> [1] 1.862645
This is the amount of free memory required just to load the matrix. Several common matrix operations for genotypes consume even more memory, so more free memory will be needed to accomplish anything.
There are several R packages that provide BED readers, and to a lesser extent some of the other functionality of genio
. Here we compare directly to BEDMatrix
and snpStats
. Each of these packages has different goals so they are optimized for their use cases. The genio
parser is optimized to read the entire genotype data into a native R matrix, so that it is easy to inspect and manipulate for R beginners. Here we demonstrate that genio
is not only the fastest at this task, but also the easiest to obtain in terms of requiring the least amount of coding.
The BEDMatrix
package allows access to genotypes from a BED file without loading the entire matrix in memory. This is very convenient for large datasets, as long as only small portions of the data are pulled at once. However, there are some disadvantages:
BEDMatrix
return object is not a regular R matrix, which confuses some users.BEDMatrix
package provides no way to access the full annotation tables (BIM and FAM files).
BEDMatrix
does not have any write functions.Here is an example of that usage and how the data compares. Note in particular that the BEDMatrix
representation of the genotype matrix is transposed compared to the genio
matrix:
library(BEDMatrix)
# Time it too.
# Although the BIM and FAM tables are not returned,
# they are partially parsed and kept in memory,
# which can take time for extremely large files
<- system.time(
time_read_bedmatrix_1 <- BEDMatrix(file_plink)
X_BEDMatrix
)#> Extracting number of samples and rownames from vignette-random-data71391bd2804.fam...
#> Extracting number of variants and colnames from vignette-random-data71391bd2804.bim...
# Inspect the first 10 loci and individuals as usual.
# Note it is transposed compared to our X.
# Also note the column and row names are different from genio's.
1:10, 1:10]
X_BEDMatrix[#> rs1_T rs2_A rs3_T rs4_T rs5_T rs6_A rs7_T rs8_A rs9_A rs10_T
#> fam1_id1 1 0 1 NA 0 0 0 1 2 0
#> fam2_id2 1 2 1 2 0 1 0 2 1 0
#> fam3_id3 1 0 1 NA NA 1 2 1 1 0
#> fam4_id4 1 0 1 1 1 1 2 2 NA 1
#> fam5_id5 2 0 0 1 0 NA 1 1 1 2
#> fam6_id6 2 1 0 2 1 1 1 2 0 1
#> fam7_id7 NA 1 2 0 2 1 0 1 2 2
#> fam8_id8 1 0 2 2 1 1 NA 1 0 2
#> fam9_id9 2 2 1 1 NA 2 NA 0 NA 2
#> fam10_id10 1 1 1 NA 1 1 2 1 1 1
Therefore, for very large datasets, if it suffices to access the genotype data in small portions and the user is willing to deal with the limitations of the object, and no detailed annotation table information is required, then BEDMatrix
is a better solution than the read_bed
function provided in the genio
package.
To compare the two genotype reading functions on an equal footing, let us assume that the entire genotype matrix is required in memory and stored in a regular R matrix.
# This turns it into a regular R matrix.
# Since most of the reading is actually happening now,
# we time this step now.
<- system.time(
time_read_bedmatrix_2 <- as.matrix(X_BEDMatrix)
X_BEDMatrix_Rmat
)
time_read_bedmatrix_2#> user system elapsed
#> 0.057 0.012 0.069
# Now we can test that the BEDMatrix agrees with the original matrix we simulated.
# Need to transpose first.
stopifnot( all( X == t(X_BEDMatrix_Rmat), na.rm = TRUE) )
The snpStats
package is the most fully-featured alternative to genio
. One of its advantages is its memory efficiency, encoding the entire data in less memory than a native R matrix. However, this memory efficiency comes at the cost of making the data harder to access, especially for inexperienced R users. Like genio
, and in contrast to the other packages, snpStats
also reads the BIM and FAM tables fully, and provides a BED writer, but it is considerably harder to use.
First we illustrate data parsing. The annotation tables are similar but column names are different, and certain missing values (zero in text files) are converted to NA
instead.
library(snpStats)
#> Loading required package: survival
#> Loading required package: Matrix
# Read data, time it.
<- system.time(
time_read_snpStats_1 <- read.plink(file_plink)
data_snpStats
)
time_read_snpStats_1#> user system elapsed
#> 0.061 0.003 0.064
# Inspect the data
# Genotypes.
# Note data is not visualized this way.
# This matrix is also transposed compared to the genio matrix.
$genotypes
data_snpStats#> A SnpMatrix with 1001 rows and 10001 columns
#> Row names: fam1 ... fam1001
#> Col names: rs1 ... rs10001
# Locus annotations
head( data_snpStats$map )
#> chromosome snp.name cM position allele.1 allele.2
#> rs1 chr1 rs1 NA 1000 T C
#> rs2 chr1 rs2 NA 2000 A C
#> rs3 chr1 rs3 NA 3000 T C
#> rs4 chr1 rs4 NA 4000 T G
#> rs5 chr1 rs5 NA 5000 T G
#> rs6 chr1 rs6 NA 6000 A C
# Individual annotations
head (data_snpStats$fam )
#> pedigree member father mother sex affected
#> fam1 fam1 id1 NA NA 1 0.14973301
#> fam2 fam2 id2 NA NA 1 1.09398337
#> fam3 fam3 id3 NA NA 2 -0.06689038
#> fam4 fam4 id4 NA NA 1 -0.96776919
#> fam5 fam5 id5 NA NA 2 -1.85701341
#> fam6 fam6 id6 NA NA 1 -1.30706893
As for BEDMatrix
, assuming we ultimately desire to convert the entire data into a regular R matrix, an extra step is required:
# Transpose, then convert to a regular R matrix.
# Let's time this step too.
<- system.time(
time_read_snpStats_2 <- as( t(data_snpStats$genotypes), 'numeric')
X_snpStats
)
time_read_snpStats_2#> user system elapsed
#> 0.043 0.021 0.066
# Now we can visualize the matrix.
# First 10 loci and individuals, as before.
# Note that, compared to (genio, BEDMatrix), alleles are encoded in reverse,
# so 0s and 2s are flipped in this matrix.
1:10, 1:10]
X_snpStats[#> fam1 fam2 fam3 fam4 fam5 fam6 fam7 fam8 fam9 fam10
#> rs1 1 1 1 1 0 0 NA 1 0 1
#> rs2 2 0 2 2 2 1 1 2 0 1
#> rs3 1 1 1 1 2 2 0 0 1 1
#> rs4 NA 0 NA 1 1 0 2 0 1 NA
#> rs5 2 2 NA 1 2 1 0 1 NA 1
#> rs6 2 1 1 1 NA 1 1 1 0 1
#> rs7 2 2 0 0 1 1 2 NA NA 0
#> rs8 1 0 1 0 1 0 1 1 2 1
#> rs9 0 1 1 NA 1 2 0 2 NA 1
#> rs10 2 2 2 1 0 1 0 0 0 1
# Again verify that the matrices are identical.
# (Here 2-X flips back 0s and 2s)
stopifnot( all( X == 2 - X_snpStats, na.rm = TRUE) )
snpStats
is the only package other than genio
to provide a BED writer. Here we demonstrate how hard it is to use it to write our data.
# Let's write this to another file
<- tempfile('vignette-random-data-copy')
file_plink_copy
# Copy objects to not change originals
<- X
X_snpStats <- as.data.frame(bim) # to use rownames
bim_snpStats <- as.data.frame(fam) # ditto
fam_snpStats
# All data requires matching row and/or column names.
# These first two were already done above.
#rownames(X_snpStats) <- bim$id
#colnames(X_snpStats) <- fam$id
# Row names here are redundant but required.
rownames(bim_snpStats) <- bim$id
rownames(fam_snpStats) <- fam$id
# We shall time several required steps in order to write genotypes in a standard R matrix,
# and the related annotation tables, to BED.
<- system.time({
time_write_snpStats # Transpose and convert our genotypes to SnpMatrix object.
# We flip 0s and 2s before converting
<- as(2 - t(X_snpStats), 'SnpMatrix')
X_snpStats
# This complicated command is required to write the data.
# Although X, fam, and bim are passed as for genio's write_plink,
# here the name of every column must be specified (there are no reasonable defaults).
# Interestingly, the parameter names of snpStats' write.plink
# do not agree with read.plink from the same package.
write.plink(
file_plink_copy,snps = X_snpStats,
subject.data = fam_snpStats,
pedigree = fam,
id = id,
father = pat,
mother = mat,
sex = sex,
phenotype = pheno,
snp.data = bim_snpStats,
chromosome = chr,
genetic.distance = posg,
position = pos,
allele.1 = ref,
allele.2 = alt
)
})#> Writing FAM file to /tmp/RtmpLWRGh3/vignette-random-data-copy713914fef710.fam
#> Writing extended MAP file to /tmp/RtmpLWRGh3/vignette-random-data-copy713914fef710.bim
#> Writing BED file to /tmp/RtmpLWRGh3/vignette-random-data-copy713914fef710.bed (SNP-major mode)
# remove the new file, no longer need it
delete_files_plink(file_plink_copy)
Note that reading performance varies on different machines depending on the balance between hard drive access and processor speeds (where the relative bottleneck is). That being said, the genio
reader is consistently the fastest, if not a close second (see below for tests on your machine), for several reasons. Genotypes are read very efficiently using Rcpp
code, and stored directly into an R matrix, which is most efficient if that is the end goal. The annotation tables are also read most efficiently, using the readr
package internally.
The BEDMatrix
reader is also written in Rcpp
, but its main goal is to provide efficient random access to genotypes stored in a file, which makes obtaining an R
matrix more awkward, though surprisingly without paying a time penalty for it. The annotation tables are read with data.table
, which is actually the fastest, though the difference in performance is small compared to readr
for reasonably-sized files. However, BEDMatrix
does not process or return full annotation tables, which gives it an unfair advantage compared to genio
, as BEDMatrix
takes shortcuts to only read the columns it needs. If the annotation tables are needed, reading them will incur an additional (and in some cases considerable) time penalty.
The snpStats
reader is written in C
, so it is also very fast for its initial step, but converting the genotypes from the snpStats
format into a native R matrix proves too expensive. The annotation tables are read with the base function read.table
, which is also the slowest. In terms of completeness of output (full genotypes and annotations tables), only this package matches genio
, which makes it the fairest comparison.
# Extract component 3 of each time object,
# which is is total time elapsed.
# Sum the two steps it takes for each of BEDMatrix and snpStats to obtain a native R matrix.
<- c(
times_read 3],
time_read_genio[3] + time_read_bedmatrix_2[3],
time_read_bedmatrix_1[3] + time_read_snpStats_2[3]
time_read_snpStats_1[
)<- c(
names_read 'genio',
'BEDMatrix',
'snpStats'
)# Create barplot
barplot(
times_read,names.arg = names_read,
main = 'BED reader runtimes',
xlab = 'packages',
ylab = 'runtime (s)'
)
Now we compare the writers. Only snpStats
and genio
have a BED writer. Not only was the genio
writer much easier to use, it is also considerably faster.
<- c(
times_write 3],
time_write_genio[3]
time_write_snpStats[
)<- c(
names_write 'genio',
'snpStats'
)# Create barplot
barplot(
times_write,names.arg = names_write,
main = 'BED writer runtimes',
xlab = 'packages',
ylab = 'runtime (s)'
)
High memory consumption is the main sacrifice for the ease of use of native R matrices, and in this respect genio
consumes the most memory.
We use the pryr
package to compare the memory usage of each native genotype object.
library(lobstr)
# Store directly into a vector
<- c(
sizes obj_size( X ),
obj_size( data_genio$X ),
obj_size( X_BEDMatrix ),
obj_size( data_snpStats$genotypes )
)<- c(
names_sizes 'original',
'genio',
'BEDMatrix',
'snpStats'
)# Create barplot
barplot(
as.numeric( sizes ),
names.arg = names_sizes,
main = 'Native genotype object sizes',
xlab = 'packages',
ylab = 'memory (bytes)'
)
The BEDMatrix
package is ideal for very large files, since data is loaded into memory only as needed. So the main object does not actually hold any genotypes in memory. It is up to the user to decide how much data to access at once to trade-off speed and memory consumption.
snpStats
manages memory better than genio
, but strictly by a factor of 4. Each genotype is encoded as an integer in R
(in general) and genio
(in particular), which uses 4 bytes (this is actually platform dependent). On the other hand, snpStats
uses strictly one byte per genotype.
Overall, there is a narrow window in data sizes in which a genotype matrix is too large for a native R matrix but small enough for a snpStats
object. That, and the much more complex interface of snpStats
, makes it not very worthwhile in my opinion, unless there is need for functions provided only by that package. I prefer to use BEDMatrix
for large datasets.
This handy function removes the three BED/BIM/FAM files we generated at the beginning of this vignette.
delete_files_plink(file_plink)
Let’s close by showing the package versions used when this vignette was last built, as the implementations compared here could change in the future.
sessionInfo()
#> R version 4.2.2 (2022-10-31)
#> Platform: x86_64-redhat-linux-gnu (64-bit)
#> Running under: Fedora Linux 37 (Workstation Edition)
#>
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib64/libflexiblas.so.3.2
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] lobstr_1.1.2 snpStats_1.44.0 Matrix_1.5-3 survival_3.4-0
#> [5] BEDMatrix_2.0.3 genio_1.1.2
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.9 highr_0.10 bslib_0.4.2
#> [4] compiler_4.2.2 pillar_1.8.1 pryr_0.1.5
#> [7] jquerylib_0.1.4 tools_4.2.2 zlibbioc_1.40.0
#> [10] bit_4.0.5 crochet_2.3.0 digest_0.6.31
#> [13] jsonlite_1.8.4 evaluate_0.19 lifecycle_1.0.3
#> [16] tibble_3.1.8 lattice_0.20-45 pkgconfig_2.0.3
#> [19] rlang_1.0.6 cli_3.5.0 parallel_4.2.2
#> [22] yaml_2.3.6 xfun_0.36 fastmap_1.1.0
#> [25] stringr_1.5.0 knitr_1.41 hms_1.1.2
#> [28] vctrs_0.5.1 sass_0.4.4 tidyselect_1.2.0
#> [31] bit64_4.0.5 grid_4.2.2 data.table_1.14.6
#> [34] glue_1.6.2 R6_2.5.1 fansi_1.0.3
#> [37] vroom_1.6.0 rmarkdown_2.19 tzdb_0.3.0
#> [40] readr_2.1.3 magrittr_2.0.3 ellipsis_0.3.2
#> [43] codetools_0.2-18 htmltools_0.5.4 splines_4.2.2
#> [46] BiocGenerics_0.40.0 utf8_1.2.2 stringi_1.7.8
#> [49] cachem_1.0.6 crayon_1.5.2
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.