| Title: | S4 Tools for Reading and Organizing Genetic Data |
| Version: | 0.1.0 |
| Description: | Provides an integrated suite of tools for handling single nucleotide polymorphism (SNP) genotype data in large-scale genetic studies. Supports importing and merging genotype files, performing quality control on SNP markers and samples, and preparing data for downstream analyses using popular software such as 'FImpute' and 'PLINK'. Offers S4 classes and methods to efficiently encapsulate SNP data, along with utilities for generating genotype summary statistics and visualization. Additional functionalities include anticlustering approaches for batch effect control, automated script generation for external software, and streamlined workflows for large datasets commonly encountered in animal and plant breeding programs. Designed to facilitate reproducible and scalable SNP data analyses in quantitative and statistical genetics. |
| Depends: | R (≥ 4.1.0) |
| Imports: | methods, ggplot2, dplyr, data.table, Rcpp, stringi, anticlust, grDevices, graphics, stats, utils, MASS, snpStats, magrittr, reshape2 |
| LinkingTo: | Rcpp |
| Suggests: | knitr, rmarkdown |
| VignetteBuilder: | knitr |
| Encoding: | UTF-8 |
| License: | GPL-3 |
| URL: | https://viniciusjunqueira.github.io/SNPkit/, https://github.com/viniciusjunqueira/SNPkit |
| BugReports: | https://github.com/viniciusjunqueira/SNPkit/issues |
| RoxygenNote: | 7.3.3 |
| NeedsCompilation: | yes |
| Packaged: | 2026-06-22 15:01:21 UTC; viniciusjunqueira |
| Author: | Vinícius Junqueira [aut, cre], Roberto Higa [aut], Fernando Flores Cardoso [aut], Marcos Jun Iti Yokoo [aut] |
| Maintainer: | Vinícius Junqueira <junqueiravinicius@hotmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-06-26 09:40:08 UTC |
SNPkit: S4 tools for reading and organizing genetic data
Description
Utilities for reading, cleaning, summarizing, and modeling SNP genotype data.
Author(s)
Vinícius Junqueira junqueiravinicius@hotmail.com Roberto Higa roberto.higa@embrapa.br Fernando Flores Cardoso fernando.cardoso@embrapa.br Marcos Jun Iti Yokoo marcos.yokoo@embrapa.br
See Also
Useful links:
Report bugs at https://github.com/viniciusjunqueira/SNPkit/issues
FImputeExport Class
Description
A class to handle export preparation for FImpute.
Slots
genoA SnpMatrix or NULL containing genotype data.
mapA data.frame containing marker information.
pathOutput file path.
nameProject or file name.
Build FImputeRunner object
Description
A convenience function to construct a 'FImputeRunner' object from a 'SNPDataLong' object.
Usage
FImputeRunner(object, path, exec_path = "FImpute3", name = "data")
Arguments
object |
An object of class 'SNPDataLong', from which 'geno' and 'map' slots will be extracted. |
path |
A character string indicating the directory to save FImpute files. |
exec_path |
Path to the FImpute executable (default = "FImpute3"). |
name |
Name for the dataset (used internally, default = "data"). |
Value
An object of class 'FImputeRunner'.
FImputeRunner Class
Description
A class to manage FImpute execution and results.
Slots
exportAn FImputeExport object.
par_filePath to parameter file.
exec_pathPath to FImpute executable.
resultsA data.frame containing results or summary information.
SNPFileConfig Class
Description
A class for configuring SNP file import options.
Slots
pathPath to the SNP file.
fieldsA list specifying column mappings or field configurations.
codesCharacter vector for genotype or allele codes.
thresholdNumeric value for filtering or quality control.
sepCharacter specifying the field separator.
skipNumber of lines to skip at the top of the file.
SNPImportList Class
Description
A class for managing a list of SNP file import configurations.
Slots
configsA list of SNPFileConfig objects.
Subset an SNPDataLong object
Description
Subsets an SNPDataLong object by rows (individuals) or columns (SNPs).
You can specify which individuals or SNP markers to keep or remove.
Usage
Subset(object, index, margin = 1, keep = TRUE)
## S4 method for signature 'SNPDataLong'
Subset(object, index, margin = 1, keep = TRUE)
Arguments
object |
A |
index |
Character vector with row (individual) or column (SNP) names to filter. |
margin |
Integer: 1 = rows (individuals), 2 = columns (SNPs). |
keep |
Logical; if |
Value
A new SNPDataLong object, subsetted accordingly.
Convert a genotype matrix or data.frame to snpStats::SnpMatrix
Description
This function converts a genotype matrix coded as 0/1/2/NA or AA/AB/BB to a
snpStats::SnpMatrix object. It includes checks for coding validity,
missing values, and duplicate sample or SNP IDs, and preserves row and column
names from the input.
Usage
as_snpmatrix(
geno,
coding = c("012", "AAABBB"),
missing_codes = c("NA", "-9", ".", ""),
check_ids = TRUE
)
Arguments
geno |
A samples x SNPs matrix or data.frame with genotypes coded as
0, 1, 2, or NA. Can be numeric/integer or character. |
coding |
One of |
missing_codes |
Character values to treat as missing (only used when
|
check_ids |
If |
Details
The function accepts both matrix and data.frame inputs. For
data.frame objects, all columns are coerced to a common type using
as.matrix(), which preserves rownames and colnames.
The returned SnpMatrix object stores each genotype as a single byte,
which is memory-efficient compared to integer storage. However, large datasets
still require substantial RAM. For very large genotype sets, consider using
on-disk formats such as SNPRelate (GDS) or bigsnpr.
Value
A snpStats::SnpMatrix with the same dimnames as geno.
Examples
# Numeric 0/1/2 with NAs
set.seed(1)
geno <- matrix(sample(c(0L,1L,2L,NA), 20, replace=TRUE), nrow=5)
rownames(geno) <- paste0("ind", 1:5)
colnames(geno) <- paste0("snp", 1:4)
SM <- as_snpmatrix(geno)
# Character AA/AB/BB
geno_c <- matrix(sample(c("AA","AB","BB","."), 20, replace=TRUE,
prob=c(.35,.3,.3,.05)), nrow=5)
rownames(geno_c) <- rownames(geno)
colnames(geno_c) <- colnames(geno)
SMc <- as_snpmatrix(geno_c, coding="AAABBB", missing_codes=".")
Safe cbind for SnpMatrix preserving dimnames
Description
This function performs a column-wise binding of multiple SnpMatrix objects,
explicitly preserving row names and column names, avoiding unexpected "object has no names" warnings.
Usage
cbind_SnpMatrix(...)
Arguments
... |
SnpMatrix objects to combine (must have identical row names). |
Value
A single combined SnpMatrix with preserved row and column names.
Examples
m1 <- methods::new("SnpMatrix",
matrix(as.raw(1:3), nrow = 3, ncol = 2,
dimnames = list(c("S1", "S2", "S3"),
c("SNP1", "SNP2"))))
m2 <- methods::new("SnpMatrix",
matrix(as.raw(1:3), nrow = 3, ncol = 2,
dimnames = list(c("S1", "S2", "S3"),
c("SNP3", "SNP4"))))
cbind_SnpMatrix(m1, m2)
Check SNP call rate
Description
Identifies SNPs with call rates below a minimum threshold.
Usage
check.call.rate(summary, min.call.rate)
Arguments
summary |
A data frame with SNP summary statistics (must contain 'Call.rate' column). |
min.call.rate |
Numeric value specifying the minimum acceptable call rate. |
Value
Character vector with SNP names below threshold. Returns 'NULL' if none.
Examples
df <- data.frame(Call.rate = c(0.85, 0.95), row.names = c("SNP1", "SNP2"))
check.call.rate(df, 0.9)
Check Identity-By-State (IBS) for a genotype pair
Description
Checks IBS status for two genotypes.
Usage
check.ibs(gen)
Arguments
gen |
Numeric vector of length two with genotype codes. |
Value
Integer: 2 if identical non-heterozygotes, 0 if opposite homozygotes, -1 otherwise.
Examples
check.ibs(c(1, 1))
check.ibs(c(1, 3))
Check identical samples based on distance
Description
Identifies sample pairs considered identical based on genotype distances.
Usage
check.identical.samples(genotypes, threshold = 0)
Arguments
genotypes |
Genotype matrix (samples x SNPs) or SnpMatrix. |
threshold |
Numeric distance threshold. Default 0. |
Value
Data frame of identical sample pairs.
Examples
mat <- matrix(sample(0:2, 20, TRUE), nrow = 5)
rownames(mat) <- paste0("S", 1:5)
check.identical.samples(mat, 0.5)
Check identical samples by block
Description
Identifies identical samples within SNP blocks.
Usage
check.identical.samples.by.block(genotypes, blcsize, threshold = 0)
Arguments
genotypes |
Genotype matrix. |
blcsize |
Block size (number of SNPs). |
threshold |
Distance threshold. Default 0. |
Value
List of identical sample pairs.
Examples
set.seed(1)
mat <- matrix(sample(1:3, 40, TRUE), nrow = 4)
rownames(mat) <- paste0("S", 1:4)
check.identical.samples.by.block(mat, blcsize = 5, threshold = 0)
Check Mendelian inconsistencies
Description
Identifies Mendelian inconsistencies between father-child pairs.
Usage
check.mendelian.inconsistencies(genotypes, father, child)
Arguments
genotypes |
Genotype matrix. |
father |
Vector of father sample IDs. |
child |
Vector of child sample IDs. |
Value
Data frame summarizing inconsistencies per pair.
Examples
set.seed(1)
genotypes <- matrix(sample(1:3, 30, TRUE), nrow = 3,
dimnames = list(c("F1", "C1", "C2"), NULL))
check.mendelian.inconsistencies(genotypes,
father = "F1",
child = c("C1", "C2"))
Check Mendelian inconsistencies for a pair
Description
Calculates number of inconsistencies and total comparable SNPs for a parent-child pair.
Usage
check.mendelian.inconsistencies.pair(g1, g2)
Arguments
g1 |
Genotype vector for parent. |
g2 |
Genotype vector for child. |
Value
Numeric vector: [# inconsistencies, # comparable SNPs].
Examples
g1 <- c(1, 1, 3, 3, 2)
g2 <- c(3, 1, 1, 3, 2)
check.mendelian.inconsistencies.pair(g1, g2)
Check Sample Call Rate
Description
Identifies samples with call rate below a given threshold.
Usage
check.sample.call.rate(sample.summary, min.call.rate)
Arguments
sample.summary |
A data frame with a "Call.rate" column for each sample. |
min.call.rate |
Minimum acceptable call rate (between 0 and 1). |
Value
A character vector with the names of samples to remove.
Check sample heterozygosity
Description
Identifies samples with heterozygosity values deviating beyond a specified threshold.
Usage
check.sample.heterozygosity(sample.summary, max.dev)
Arguments
sample.summary |
Data frame containing sample summary (must have 'Heterozygosity' column). |
max.dev |
Maximum number of standard deviations allowed from mean. |
Value
Character vector with sample names considered outliers. Returns 'NULL' if none.
Examples
ss <- data.frame(Heterozygosity = c(0.2, 0.5, 0.7))
rownames(ss) <- c("Ind1", "Ind2", "Ind3")
check.sample.heterozygosity(ss, 1)
Check SNP by chromosome
Description
Filters SNP names belonging to specified chromosomes.
Usage
check.snp.chromo(snpmap, chromosomes)
Arguments
snpmap |
Data frame with SNP map info (must contain columns 'Chromosome' and 'Name'). |
chromosomes |
Vector of chromosome identifiers to filter. |
Value
Character vector with SNP names.
Examples
snpmap <- data.frame(Chromosome = c(1, 1, 2), Name = c("SNP1", "SNP2", "SNP3"))
check.snp.chromo(snpmap, 1)
Check SNP Hardy-Weinberg equilibrium deviation
Description
Identifies SNPs deviating from HWE beyond a z-score threshold.
Usage
check.snp.hwe(snp.summary, max.dev)
Arguments
snp.summary |
Data frame with SNP summary (must contain 'z.HWE' column). |
max.dev |
Maximum z-score allowed. |
Value
Character vector with SNP names deviating from HWE. Returns 'NULL' if none.
Examples
df <- data.frame(z.HWE = c(2, 5), row.names = c("SNP1", "SNP2"))
check.snp.hwe(df, 3)
Check SNPs for Hardy-Weinberg equilibrium deviation using chi-square p-values
Description
This function identifies SNP markers whose Hardy-Weinberg equilibrium (HWE) chi-square p-values
indicate significant deviation beyond a specified threshold. It uses the p-values computed by
get.hwe.chi2 on the input summary data frame.
Usage
check.snp.hwe.chi2(snp.summary, max.dev)
Arguments
snp.summary |
A data frame or matrix containing summary statistics for SNP markers.
The row names should correspond to SNP identifiers. It must be compatible with
the function |
max.dev |
A numeric value specifying the maximum acceptable p-value threshold. SNPs with p-values below this threshold are considered as deviating from HWE. |
Details
Any SNP with missing p-value (NA) is treated as not failing (returned as FALSE).
Value
A character vector of SNP identifiers (rownames) that fail the HWE test (p-value < max.dev).
If no SNPs fail, an empty vector is returned.
See Also
Examples
snp.summary <- data.frame(
Calls = c(100, 100),
P.AA = c(0.25, 0.7),
P.AB = c(0.50, 0.05),
P.BB = c(0.25, 0.25),
row.names = c("SNP1", "SNP2")
)
check.snp.hwe.chi2(snp.summary, max.dev = 0.05)
Check SNP minor allele frequency
Description
Identifies SNPs with minor allele frequency below a minimum threshold.
Usage
check.snp.maf(snp.summary, min.maf)
Arguments
snp.summary |
Data frame with SNP summary (must contain 'MAF' column). |
min.maf |
Minimum MAF allowed. |
Value
Character vector with SNP names below threshold. Returns 'NULL' if none.
Examples
df <- data.frame(MAF = c(0.01, 0.2), row.names = c("SNP1", "SNP2"))
check.snp.maf(df, 0.05)
Check SNP missing genotype frequencies
Description
Identifies SNPs with genotype frequencies below a minimum threshold.
Usage
check.snp.mgf(snp.summary, min.mgf)
Arguments
snp.summary |
Data frame with columns 'P.AA', 'P.AB', 'P.BB'. |
min.mgf |
Minimum genotype frequency allowed. |
Value
Character vector with SNP names below threshold. Returns 'NULL' if none.
Examples
df <- data.frame(P.AA = c(0.01, 0.5), P.AB = c(0.02, 0.4), P.BB = c(0.01, 0.1))
rownames(df) <- c("SNP1", "SNP2")
check.snp.mgf(df, 0.05)
Check SNP monomorphic status
Description
Identifies SNPs considered monomorphic.
Usage
check.snp.monomorf(snp.summary)
Arguments
snp.summary |
Data frame with columns 'P.AA', 'P.AB', 'P.BB'. |
Value
Character vector with monomorphic SNP names. Returns 'NULL' if none.
Examples
df <- data.frame(P.AA = c(1, 0.5), P.AB = c(0, 0.5), P.BB = c(0, 0))
rownames(df) <- c("SNP1", "SNP2")
check.snp.monomorf(df)
Check SNP no position
Description
Identifies SNPs with position equal to zero in the SNP map.
Usage
check.snp.no.position(snpmap)
Arguments
snpmap |
Data frame with columns 'Position' and 'Name'. |
Value
Character vector with SNP names without position. Returns 'NULL' if none.
Examples
df <- data.frame(Position = c(0, 100), Name = c("SNP1", "SNP2"))
check.snp.no.position(df)
Check SNPs mapped to the same position
Description
Identifies groups of SNPs that are mapped to the exact same genomic position on each chromosome. Returns a list where each element corresponds to one group of overlapping SNPs.
Identifies SNPs that share the same position on the same chromosome.
Usage
check.snp.same.position(snpmap)
check.snp.same.position(snpmap)
Arguments
snpmap |
Data frame with columns 'Chromosome', 'Position', and 'Name'. |
Value
A list of character vectors, each with names of SNPs found at the same position.
List of SNP groups sharing positions.
Examples
df <- data.frame(Chromosome = c(1, 1, 2),
Position = c(100, 100, 200),
Name = c("SNP1", "SNP2", "SNP3"))
check.snp.same.position(df)
Combine multiple SNPDataLong objects
Description
This function merges a list of SNPDataLong objects, typically representing different SNP panels
or datasets, into a single unified SNPDataLong object. It ensures that all genotype matrices
have the same set of SNPs (filling missing SNPs with NA), and merges the marker map information while
removing duplicate SNP entries.
Usage
combineSNPData(lista)
Arguments
lista |
A list of |
Value
A single SNPDataLong object containing the combined genotype matrix, merged map,
and a concatenated path string.
Examples
make_obj <- function(samples, snps) {
m <- methods::new("SnpMatrix",
matrix(as.raw(1:3),
nrow = length(samples),
ncol = length(snps),
dimnames = list(samples, snps)))
methods::new("SNPDataLong",
geno = m,
map = data.frame(Name = snps,
Chromosome = 1,
Position = seq_along(snps)),
path = tempfile(),
xref_path = "chip1")
}
obj1 <- make_obj(c("S1", "S2"), c("SNP1", "SNP2"))
obj2 <- make_obj(c("S3", "S4"), c("SNP2", "SNP3"))
combined <- combineSNPData(list(obj1, obj2))
Do genome relationship matrix PCA
Description
Performs PCA using the genome relationship matrix (GRM).
Usage
doPCA(genotypes)
Arguments
genotypes |
Genotype matrix. |
Value
List containing 'pcs' (principal components) and 'eigen' (eigenvalues).
Examples
set.seed(1)
mat <- matrix(sample(as.raw(1:3), 200, TRUE), nrow = 10, ncol = 20)
geno <- methods::new("SnpMatrix", mat)
rownames(geno) <- paste0("S", 1:10)
colnames(geno) <- paste0("SNP", 1:20)
res <- doPCA(geno)
str(res)
Exploratory plots for SNP and sample summary
Description
Generates exploratory plots: MAF histograms, HWE plots, heterozygosity scatter, MDS, and dendrogram.
Usage
exploratory.plots(
snp.summary,
snps.plot,
sample.summary,
samples.plot,
distm,
glabels,
mds.plot,
hierq.plot
)
Arguments
snp.summary |
Data frame with SNP summary. |
snps.plot |
Filename for SNP histogram plot. |
sample.summary |
Data frame with sample summary. |
samples.plot |
Filename for heterozygosity plot. |
distm |
Distance matrix for samples. |
glabels |
Sample labels for plots. |
mds.plot |
Filename for MDS plot. |
hierq.plot |
Filename for hierarchical cluster plot. |
Value
NULL (plots are saved as JPEG files)
Examples
tmp <- tempfile(fileext = ".jpg")
snp.summary <- data.frame(
MAF = runif(20),
z.HWE = rnorm(20),
Calls = rep(100, 20),
P.AA = runif(20, 0, 0.5),
P.AB = runif(20, 0, 0.5),
P.BB = runif(20, 0, 0.5)
)
sample.summary <- data.frame(
Call.rate = runif(5, 0.9, 1),
Heterozygosity = runif(5, 0.2, 0.4),
row.names = paste0("S", 1:5)
)
distm <- stats::dist(matrix(rnorm(25), nrow = 5))
exploratory.plots(snp.summary,
snps.plot = tempfile(fileext = ".jpg"),
sample.summary = sample.summary,
samples.plot = tempfile(fileext = ".jpg"),
distm = distm,
glabels = paste0("S", 1:5),
mds.plot = tempfile(fileext = ".jpg"),
hierq.plot = tempfile(fileext = ".jpg"))
Convert geno slot from SNPDataLong to a data.frame
Description
Converts the genotype matrix (geno slot) of a SNPDataLong object to a data.frame, with optional centering and scaling per SNP (column).
Usage
genoToDF(object, center = FALSE, scale = FALSE)
Arguments
object |
An object of class SNPDataLong. |
center |
Logical or numeric. If TRUE (default FALSE), center columns to mean zero. |
scale |
Logical or numeric. If TRUE (default FALSE), scale columns to standard deviation one. |
Value
A data.frame with individuals as rows and SNPs as columns (numeric 0/1/2, or centered/scaled values).
Examples
set.seed(1)
raw_mat <- matrix(as.raw(sample(1:3, 100, TRUE)), nrow = 10, ncol = 10)
rownames(raw_mat) <- paste0("S", 1:10)
colnames(raw_mat) <- paste0("SNP", 1:10)
geno <- methods::new("SnpMatrix", raw_mat)
obj <- methods::new("SNPDataLong",
geno = geno,
map = data.frame(Name = colnames(geno),
Chromosome = 1,
Position = 1:10),
path = tempfile(),
xref_path = "chip1")
df <- genoToDF(obj, center = TRUE, scale = TRUE)
head(df[, 1:5])
Get correlation (fc method)
Description
Calculates genotype correlation using a fast check (fc) method.
Usage
get.correl.fc(g1, g2)
Arguments
g1 |
Genotype vector. |
g2 |
Genotype vector. |
Value
Numeric value of correlation.
Examples
g1 <- sample(0:2, 10, TRUE)
g2 <- sample(0:2, 10, TRUE)
get.correl.fc(g1, g2)
Get gender based on heterozygosity
Description
Infers gender using heterozygosity thresholds.
Usage
get.gender(sample.summary, threshM, threshF)
Arguments
sample.summary |
Data frame with 'Heterozygosity' column. |
threshM |
Numeric threshold for males. |
threshF |
Numeric threshold for females. |
Value
Data frame with columns 'heterozygosity' and 'sex'.
Examples
df <- data.frame(Heterozygosity = c(0.1, 0.3, 0.6))
rownames(df) <- c("A", "B", "C")
get.gender(df, 0.2, 0.5)
Get HWE chi-square p-values
Description
Calculates Hardy-Weinberg equilibrium chi-square p-values for SNPs.
Usage
get.hwe.chi2(snp.summary)
Arguments
snp.summary |
Data frame with columns 'Calls', 'P.AA', 'P.AB', 'P.BB'. |
Value
Numeric vector with p-values.
Examples
df <- data.frame(Calls = c(100, 100), P.AA = c(0.6, 0.4), P.AB = c(0.3, 0.4), P.BB = c(0.1, 0.2))
get.hwe.chi2(df)
Flexible and efficient genotype file reading with autodetection using fread
Description
Allows flexible import of SNP genotype data from Illumina FinalReport files,
using fast initial column detection via data.table::fread, followed by
full genotype matrix construction with snpStats::read.snps.long.
Usage
getGeno(...)
## S4 method for signature 'ANY'
getGeno(
path,
fields = list(sample = 2, snp = 1, allele1 = 7, allele2 = 8, confidence = 9),
codes = c("A", "B"),
threshold = 0.15,
sep = "\t",
skip = 0,
verbose = TRUE,
every = NULL
)
Arguments
... |
Additional optional arguments. |
path |
Path to the directory containing |
fields |
List specifying column indices (sample, snp, allele1, allele2, confidence) |
codes |
Allele codes (e.g., |
threshold |
Confidence threshold |
sep |
Field separator |
skip |
Lines to skip |
verbose |
Logical; show progress |
every |
Frequency for progress updates |
Value
An SNPDataLong object
IBS pair statistics
Description
Calculates IBS mean and standard deviation between two samples.
Usage
ibs.pair(g1, g2)
Arguments
g1 |
Genotype vector for first sample. |
g2 |
Genotype vector for second sample. |
Value
Numeric vector: [mean IBS, standard deviation].
Examples
g1 <- sample(0:2, 10, TRUE)
g2 <- sample(0:2, 10, TRUE)
ibs.pair(g1, g2)
Import and combine multiple genotype configurations
Description
Imports genotype data from multiple configurations defined in an
SNPImportList object and combines them into a unified SNPDataLong object.
Usage
importAllGenos(object)
## S4 method for signature 'SNPImportList'
importAllGenos(object)
Arguments
object |
An |
Value
A combined SNPDataLong object.
Import imputed FImpute results from disk
Description
Reads existing imputed results from a given path and returns an object of class SNPDataLong.
Usage
importFImputeResults(path, method = "R")
Arguments
path |
Character. Path to the folder containing 'output_fimpute' (e.g., "fimpute_run_nelore"). |
method |
Character. "R" (default) or "Rcpp". Passed to read.fimpute(). |
Value
An object of class SNPDataLong containing the imputed genotypes and SNP map.
Import multiple genotype datasets from a list of configurations
Description
Reads and imports multiple genotype datasets specified in a list of configurations. Each configuration must include the path to the genotype data and information on field mapping. Optionally, you can also specify codes, quality threshold, separator, lines to skip, and a subset of IDs to retain. The function automatically fills the 'xref_path' slot per individual and combines maps into a single data.frame, adding a 'SourcePath' column indicating their origin and removing duplicated SNP rows (by Name). Prints progress messages indicating the current path being loaded (with counter).
Usage
import_geno_list(config_list)
Arguments
config_list |
A list of configuration lists. Each element should contain: - 'path' (character): Path to the genotype file or folder. - 'fields' (list): Named list defining the columns (e.g., SNP ID, sample ID, alleles, confidence). - 'codes' (character vector, optional): Allele codes (default is c("A", "B")). - 'threshold' (numeric, optional): Maximum allowed missingness or confidence threshold (default 0.15). - 'sep' (character, optional): Field separator in the input file (default "tab-delimited"). - 'skip' (integer, optional): Number of lines to skip at the beginning of the file (default 0). - 'verbose' (logical, optional): Whether to print detailed messages (default TRUE). - 'subset' (character vector, optional): Vector of sample IDs to retain after import. |
Value
An object of class 'SNPDataLong' containing: - Combined genotype matrix ('geno'). - Combined map ('map') as a single data.frame with 'SourcePath' column and without duplicated rows. - Combined 'xref_path' vector (one entry per individual). - 'path' slot as a semicolon-separated string of all input dataset paths.
Convert pairs to sets
Description
Groups sample pairs into sets of related samples.
Usage
pairs2sets(pairs)
Arguments
pairs |
Matrix or list of sample pairs. |
Value
List of sets of samples.
Examples
pairs <- matrix(c("A", "B", "B", "C", "D", "E"), ncol = 2, byrow = TRUE)
pairs2sets(pairs)
Plot PCA groups from anticlustering result
Description
Plot PCA groups from anticlustering result
Usage
plotPCAgroups(pca_res, groups, pcs = c(1, 2), filename = NULL)
Arguments
pca_res |
A prcomp object. |
groups |
A factor or vector of group assignments. |
pcs |
Vector of length 2 indicating which PCs to plot (default: c(1, 2)). |
filename |
Optional. If provided, saves plot to this file (e.g., "antic.png"). |
Value
A ggplot object (also prints to screen).
Examples
set.seed(1)
pca_res <- stats::prcomp(matrix(rnorm(200), nrow = 20))
groups <- sample(1:2, 20, replace = TRUE)
plotPCAgroups(pca_res, groups)
Print method for SNPDataLong summary
Description
Displays the contents of a summary.SNPDataLong object on the console.
Usage
## S3 method for class 'summary.SNPDataLong'
print(x, ...)
Arguments
x |
An object of class |
... |
Further arguments (currently unused). |
Value
The input x, returned invisibly.
Quality Control for SNPDataLong with optional criteria
Description
Applies flexible quality control filters on an object of class SNPDataLong.
Supports call rate filtering, minor allele frequency (MAF), Hardy-Weinberg equilibrium (HWE),
removal of monomorphic SNPs, exclusion of specific chromosomes, optionally removing SNPs without positions,
and optionally removing SNPs at the same genomic position (keeping the one with highest MAF).
Usage
qcSNPs(x, ...)
## S4 method for signature 'SNPDataLong'
qcSNPs(
x,
missing_ind = NULL,
missing_snp = NULL,
min_snp_cr = NULL,
min_maf = NULL,
hwe = NULL,
snp_position = NULL,
no_position = NULL,
snp_mono = FALSE,
remove_chr = NULL,
action = c("report", "filter", "both")
)
Arguments
x |
An object of class SNPDataLong. |
... |
Additional optional arguments. |
missing_ind |
Maximum allowed proportion of missing data per individual (currently not implemented). |
missing_snp |
Maximum allowed proportion of missing data per SNP (currently not implemented). |
min_snp_cr |
Minimum acceptable call rate for SNPs (e.g., 0.95). SNPs below this threshold are removed. |
min_maf |
Minimum minor allele frequency allowed for SNPs (e.g., 0.05). SNPs with lower MAF are removed. |
hwe |
p-value threshold for Hardy-Weinberg equilibrium test (e.g., 1e-6). SNPs violating this are removed. |
snp_position |
Logical. If TRUE, removes SNPs mapped to the same position, retaining only the one with highest MAF. |
no_position |
Logical. If TRUE, removes SNPs without defined genomic positions. |
snp_mono |
Logical. If TRUE, removes monomorphic SNPs (with no variation). |
remove_chr |
Character vector of chromosomes to exclude (e.g., c("X", "Y")). |
action |
One of "report" (returns a list of removed SNPs), "filter" (returns filtered SNPDataLong), or "both" (returns both). |
Value
Depending on the action argument: - "report": list of SNPs removed by each filter and SNPs retained. - "filter": filtered SNPDataLong object. - "both": list containing the filtered object and detailed report.
Examples
set.seed(123)
raw_mat <- matrix(as.raw(sample(1:3, 100, TRUE)), nrow = 10, ncol = 10)
colnames(raw_mat) <- paste0("snp", 1:10)
rownames(raw_mat) <- paste0("ind", 1:10)
geno <- methods::new("SnpMatrix", raw_mat)
map <- data.frame(Name = colnames(geno), Chromosome = 1, Position = 1:10)
x <- methods::new("SNPDataLong",
geno = geno,
map = map,
path = tempfile(),
xref_path = "chip1")
qcSNPs(x,
min_snp_cr = 0.8,
min_maf = 0.05,
snp_mono = TRUE,
no_position = TRUE,
snp_position = TRUE,
action = "filter")
Quality control on samples
Description
Applies quality control (QC) procedures to samples in a 'SNPDataLong' object, based on heterozygosity and call rate thresholds.
Usage
qcSamples(x, ...)
## S4 method for signature 'SNPDataLong'
qcSamples(
x,
heterozygosity = NULL,
smp_cr = NULL,
action = c("report", "filter", "both")
)
Arguments
x |
An object of class 'SNPDataLong'. |
... |
Additional optional arguments. |
heterozygosity |
A numeric threshold or range for heterozygosity. Samples outside this threshold are removed. |
smp_cr |
Minimum acceptable sample call rate (between 0 and 1). Samples below this value are removed. |
action |
Character string indicating the action to perform. One of: - '"report"': only returns a list of samples to remove and those kept; - '"filter"': returns a filtered object without reporting; - '"both"': performs filtering and returns the filtered object. |
Value
Depending on the 'action' argument: - '"report"': returns a list with removed and kept samples; - '"filter"': returns a new 'SNPDataLong' object with filtered genotypes; - '"both"': returns a list with: - 'filtered': the filtered 'SNPDataLong' object; - 'report': a list of removed and kept samples.
Formatted header message
Description
Prints a formatted message with a border for section titles in the console.
Usage
qc_header(title)
Arguments
title |
Character string to be printed inside the header box. |
Value
No return value. Used for side effects (message).
Examples
qc_header("Quality Control on Samples")
Faster row-bind for SnpMatrix objects with differing columns
Description
Combines multiple SnpMatrix objects by rows, automatically handling differing SNP columns, optimized for large matrices.
Usage
rbindSnpFlexible(...)
Arguments
... |
One or more SnpMatrix objects. |
Value
A single SnpMatrix object with all rows combined.
Examples
m1 <- methods::new("SnpMatrix",
matrix(as.raw(1:3), nrow = 2, ncol = 3,
dimnames = list(c("S1", "S2"),
c("SNP1", "SNP2", "SNP3"))))
m2 <- methods::new("SnpMatrix",
matrix(as.raw(1:3), nrow = 2, ncol = 2,
dimnames = list(c("S3", "S4"),
c("SNP2", "SNP4"))))
rbindSnpFlexible(m1, m2)
Safe rbind for SnpMatrix preserving dimnames
Description
This function performs a row-wise binding of multiple SnpMatrix objects,
explicitly preserving row names and column names, avoiding unexpected "object has no names" warnings.
Usage
rbind_SnpMatrix(...)
Arguments
... |
SnpMatrix objects to combine (must have identical column names). |
Value
A single combined SnpMatrix with preserved row and column names.
Examples
m1 <- methods::new("SnpMatrix",
matrix(as.raw(1:3), nrow = 2, ncol = 3,
dimnames = list(c("S1", "S2"),
c("SNP1", "SNP2", "SNP3"))))
m2 <- methods::new("SnpMatrix",
matrix(as.raw(1:3), nrow = 2, ncol = 3,
dimnames = list(c("S3", "S4"),
c("SNP1", "SNP2", "SNP3"))))
rbind_SnpMatrix(m1, m2)
Read imputed genotypes from FImpute output and return SNPDataLong object
Description
Reads imputed genotypes and SNP information from FImpute output, builds a SnpMatrix and a corresponding map, and returns an SNPDataLong object.
Usage
read.fimpute(file, method = c("R", "Rcpp"))
Arguments
file |
Character. Path to the FImpute output directory (usually "output_fimpute"). |
method |
Character. "R" (default) for vectorized R implementation, or "Rcpp" for compiled C++ implementation. |
Value
An object of class SNPDataLong with three slots:
geno (a SnpMatrix with individuals as rows and SNPs as
columns), map (a data.frame with columns Name,
Chromosome, and Position), and path (the input
directory).
Examples
## Not run:
# Requires a directory containing FImpute output files
# (genotypes_imp.txt and snp_info.txt).
snp_long <- read.fimpute("output_fimpute", method = "R")
## End(Not run)
Run PCA and anticlustering on SNPDataLong
Description
Converts a SNPDataLong object to a data.frame, runs PCA, and performs anticlustering on the selected principal components.
Usage
runAnticlusteringPCA(object, K = 2, n_pcs = 20, center = TRUE, scale = TRUE)
Arguments
object |
An object of class |
K |
Number of groups for anticlustering, or a vector of group sizes (as in anticlust). |
n_pcs |
Number of top principal components to use. If |
center |
Logical or numeric. Passed to |
scale |
Logical or numeric. Passed to |
Value
A list with components:
- groups
Integer vector with anticlustering group assignments.
- pca
The PCA result object (from
stats::prcomp).- pcs
Numeric matrix of the PCs used for anticlustering.
Examples
res <- runAnticlusteringPCA(nelore_imputed, K = 2, n_pcs = 0.8)
table(res$groups)
Run FImpute from a FImputeRunner object
Description
This function runs the external FImpute software using a 'FImputeRunner' object, ensuring that all required input files are present and the results are imported.
Usage
runFImpute(object, verbose = TRUE)
## S4 method for signature 'FImputeRunner'
runFImpute(object, verbose = TRUE)
Arguments
object |
An object of class 'FImputeRunner'. |
verbose |
Logical. If TRUE (default), FImpute output will be printed to the console. |
Value
An updated 'FImputeRunner' object with the 'results' slot populated (SNPDataLong).
Examples
## Not run:
# Requires the external FImpute3 binary in PATH.
path_fimpute <- file.path(tempdir(), "fimpute_run_example")
param_file <- file.path(path_fimpute, "fimpute.par")
export_obj <- methods::new("FImputeExport",
geno = geno_obj@geno,
map = geno_obj@map,
path = path_fimpute)
runner <- methods::new("FImputeRunner",
export = export_obj,
par_file = param_file,
exec_path = "FImpute3")
res <- runFImpute(runner, verbose = TRUE)
## End(Not run)
Run ADMIXTURE analysis
Description
This function runs the ADMIXTURE program on a set of PLINK files (.bed/.bim/.fam) located in a specified directory, using a given file prefix. It supports both unsupervised and supervised analyses, optional cross-validation, and custom output file prefixes to avoid overwriting results.
Usage
run_admixture(
path,
prefix,
admixture_path = "admixture",
K,
supervised = FALSE,
pop_assignments = NULL,
extra_args = NULL,
out_prefix = NULL,
cv = NULL
)
Arguments
path |
Character. Path to the folder containing PLINK files. |
prefix |
Character. File prefix (without extension). The function will look for '<prefix>.bed', '<prefix>.bim', and '<prefix>.fam' in 'path'. |
admixture_path |
Character. Path to the ADMIXTURE executable, or "admixture" if in system PATH. Default is "admixture". |
K |
Integer. Number of ancestral populations to estimate. |
supervised |
Logical. If TRUE, runs ADMIXTURE in supervised mode (requires |
pop_assignments |
Character vector. Population assignments for each individual (length equal to number of individuals in '.fam'). Use |
extra_args |
Character vector. Additional arguments to pass to ADMIXTURE (e.g., other flags). Default is NULL. |
out_prefix |
Character. Optional prefix for renaming output files (.Q, .P, .log) after the run completes. Default is NULL. |
cv |
Integer. Number of folds for cross-validation (e.g., 5 or 10). If provided, adds |
Details
When supervised = TRUE, a '.pop' file is automatically created in the specified directory.
Each line in this file corresponds to one individual, containing the population name or "-" for missing assignments.
If out_prefix is provided, the function renames the standard ADMIXTURE output files
(e.g., '<prefix>.3.Q') to use this prefix (e.g., 'myrun.Q').
The function only works on Linux or macOS systems.
Value
No value returned. Runs ADMIXTURE as a side effect. Generates output files in the specified directory. Messages indicate progress and output file names.
Examples
## Not run:
# Requires the external ADMIXTURE binary and PLINK files prepared beforehand.
work_dir <- file.path(tempdir(), "admixture_demo")
run_admixture(
path = work_dir,
prefix = "plink_data",
admixture_path = "admixture",
K = 3,
out_prefix = "run1_k3"
)
pop_vec <- c("A", "A", "B", "B", "-", "-", "A", "B", "A", "-")
run_admixture(
path = work_dir,
prefix = "plink_data",
admixture_path = "admixture",
K = 3,
supervised = TRUE,
pop_assignments = pop_vec,
cv = 10,
out_prefix = "supervised_k3_cv10"
)
## End(Not run)
Save genotype and map files in FImpute format
Description
S4 method to export genotype (.gen), map (.map), and parameter (fimpute.par) files compatible with [FImpute](https://www.aps.uoguelph.ca/~msargol/fimpute/).
Usage
saveFImpute(object, ...)
## S4 method for signature 'FImputeExport'
saveFImpute(object)
## S4 method for signature 'SNPDataLong'
saveFImpute(object, path)
Arguments
object |
An object of class 'FImputeExport' or 'SNPDataLong'. |
... |
Additional arguments passed to methods. |
path |
Output directory. Must be supplied by the caller (e.g. a path
inside |
Value
No return value, called for side effects. The function writes the
files data.gen, data.map, and fimpute.par to the
directory path and returns NULL invisibly.
Export genotypes and map using basic arguments
Description
Convenience function to export FImpute files directly from a 'SnpMatrix' and map 'data.frame'.
Usage
saveFImputeRaw(geno, map, path, xref = NULL)
Arguments
geno |
A 'SnpMatrix' object. |
map |
A data.frame with columns 'Name', 'Chromosome', 'Position', and 'SourcePath'. |
path |
Output directory. |
xref |
Optional vector of identifiers per individual (used to assign numeric chip IDs). |
Value
No return value, called for side effects. The function writes
three files (data.gen, data.map, and fimpute.par) to
the directory specified by path and returns NULL invisibly.
Save SNPDataLong object to PLINK format
Description
Saves genotype and map data from an SNPDataLong object in PLINK format (.ped/.map and optionally binary files).
Usage
savePlink(
object,
path,
name = "plink_data",
run_plink = TRUE,
chunk_size = 1000
)
Arguments
object |
An object of class SNPDataLong. |
path |
Character. Directory where files will be saved. Must be supplied
by the caller (e.g. a folder inside |
name |
Character. Base name for PLINK output files. |
run_plink |
Logical. If TRUE (default), runs PLINK1 to convert to binary files. If FALSE, only .ped and .map files are saved. |
chunk_size |
Integer. Number of individuals per chunk for writing .ped file (default: 1000). |
Value
No return value, called for side effects. Files (.ped/.map,
and .bed/.bim/.fam when run_plink = TRUE) are
written under path.
Examples
set.seed(1)
raw_mat <- matrix(as.raw(sample(1:3, 100, TRUE)), nrow = 10, ncol = 10)
rownames(raw_mat) <- paste0("S", 1:10)
colnames(raw_mat) <- paste0("SNP", 1:10)
geno <- methods::new("SnpMatrix", raw_mat)
obj <- methods::new("SNPDataLong",
geno = geno,
map = data.frame(Name = colnames(geno),
Chromosome = 1,
Position = 1:10),
path = tempfile(),
xref_path = "chip1")
savePlink(obj, path = tempdir(), name = "demo",
run_plink = FALSE, chunk_size = 5)
Summary for SNPDataLong objects
Description
Provides a detailed summary of an SNPDataLong object, including sample
and SNP counts, proportion of missing data, and SNP distribution by chromosome
if mapping information is available.
Usage
## S4 method for signature 'SNPDataLong'
summary(object, ...)
Arguments
object |
An object of class |
... |
Further arguments passed to methods. |
Value
An object of class summary.SNPDataLong, which is a list with
the following elements:
- n_individuals
Integer. Number of individuals (rows of
geno).- n_snps
Integer. Number of SNPs (columns of
geno).- n_missing
Integer. Total number of missing genotype calls.
- prop_missing
Numeric. Proportion of missing genotype calls.
- by_chromosome
Either a table of SNP counts per chromosome (when the map provides
NameandChromosome) orNULL.- missing_by_chromosome
Either a table of SNPs with at least one missing call per chromosome, or
NULL.
The object also has a dedicated print method that displays the
summary on the console.