Type: | Package |
Title: | Genotype Quality Control with 'PLINK' |
Version: | 0.3.4 |
URL: | https://meyer-lab-cshl.github.io/plinkQC/ |
BugReports: | https://github.com/meyer-lab-cshl/plinkQC/issues |
Maintainer: | Hannah Meyer <hannah.v.meyer@gmail.com> |
Description: | Genotyping arrays enable the direct measurement of an individuals genotype at thousands of markers. 'plinkQC' facilitates genotype quality control for genetic association studies as described by Anderson and colleagues (2010) <doi:10.1038/nprot.2010.116>. It makes 'PLINK' basic statistics (e.g. missing genotyping rates per individual, allele frequencies per genetic marker) and relationship functions accessible from 'R' and generates a per-individual and per-marker quality control report. Individuals and markers that fail the quality control can subsequently be removed to generate a new, clean dataset. Removal of individuals based on relationship status is optimised to retain as many individuals as possible in the study. |
Depends: | R (≥ 3.6.0) |
Imports: | methods, optparse, data.table (≥ 1.11.0), R.utils, ggplot2, ggforce, ggrepel, cowplot, UpSetR, dplyr, igraph (≥ 1.2.4), sys |
Suggests: | testthat, mockery, formatR, knitr, rmarkdown |
License: | MIT + file LICENSE |
SystemRequirements: | plink (1.9) |
Encoding: | UTF-8 |
RoxygenNote: | 7.1.1 |
VignetteBuilder: | knitr |
NeedsCompilation: | no |
Packaged: | 2021-07-15 15:22:27 UTC; hannah |
Author: | Hannah Meyer |
Repository: | CRAN |
Date/Publication: | 2021-07-15 15:40:02 UTC |
Check and construct PLINK sample and marker filters
Description
checkFiltering checks that the file names with the individuals and markers to be filtered can be found. If so, it constructs the command for filtering
Usage
checkFiltering(
keep_individuals = NULL,
remove_individuals = NULL,
extract_markers = NULL,
exclude_markers = NULL
)
Arguments
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
Value
Vector containing args in sys::exec_wait format to enable filtering on individuals and/or markers.
Check PLINK software access
Description
checkPlink checks that the PLINK software (https://www.cog-genomics.org/plink/1.9/) can be found from system call.
Usage
checkPlink(path2plink = NULL)
Arguments
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
Value
Path to PLINK executable.
Check and construct individual IDs to be removed
Description
checkRemoveIDs checks that the file names with the individuals to be filtered can be found. It reads the corresponding files, combines the selected individuals into one data.frame and compares these to all individuals in the analysis.
Usage
checkRemoveIDs(prefix, remove_individuals = NULL, keep_individuals)
Arguments
prefix |
[character] Prefix of PLINK files, i.e. path/2/name.bed, path/2/name.bim and path/2/name.fam. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
Value
data.frame containing family (FID) and individual (IID) IDs of individuals to be removed from analysis.
Identification of individuals of divergent ancestry
Description
Runs and evaluates results of plink –pca on merged genotypes from
individuals to be QCed and individuals of reference population of known
genotypes. Currently, check ancestry only supports automatic selection of
individuals of European descent. It uses information from principal
components 1 and 2 returned by plink –pca to find the center of the European
reference samples (mean(PC1_europeanRef), mean(PC2_europeanRef). It then
computes the maximum Euclidean distance (maxDist) of the European reference
samples from this centre. All study samples whose Euclidean distance from the
centre falls outside the circle described by the radius r=europeanTh* maxDist
are considered non-European and their IDs are returned as failing the
ancestry check.
check_ancestry
creates a scatter plot of PC1 versus PC2 colour-coded
for samples of the reference populations and the study population.
Usage
check_ancestry(
indir,
name,
qcdir = indir,
prefixMergedDataset,
europeanTh = 1.5,
defaultRefSamples = c("HapMap", "1000Genomes"),
refPopulation = c("CEU", "TSI"),
refSamples = NULL,
refColors = NULL,
refSamplesFile = NULL,
refColorsFile = NULL,
refSamplesIID = "IID",
refSamplesPop = "Pop",
refColorsColor = "Color",
refColorsPop = "Pop",
studyColor = "#2c7bb6",
legend_labels_per_row = 6,
run.check_ancestry = TRUE,
interactive = FALSE,
verbose = verbose,
highlight_samples = NULL,
highlight_type = c("text", "label", "color", "shape"),
highlight_text_size = 3,
highlight_color = "#c51b8a",
highlight_shape = 17,
highlight_legend = FALSE,
legend_text_size = 5,
legend_title_size = 7,
axis_text_size = 5,
axis_title_size = 7,
title_size = 9,
keep_individuals = NULL,
remove_individuals = NULL,
exclude_markers = NULL,
extract_markers = NULL,
path2plink = NULL,
showPlinkOutput = TRUE
)
Arguments
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
name |
[character] prefix of plink files, i.e. name.bed, name.bim, name.fam. |
qcdir |
[character] /path/to/directory where prefixMergedDataset.eigenvec results as returned by plink –pca should be saved. Per default qcdir=indir. If run.check_ancestry is FALSE, it is assumed that plink –pca prefixMergedDataset has been run and qcdir/prefixMergedDataset.eigenvec exists.User needs writing permission to qcdir. |
prefixMergedDataset |
[character] Prefix of merged dataset (study and reference samples) used in plink –pca, resulting in prefixMergedDataset.eigenvec. |
europeanTh |
[double] Scaling factor of radius to be drawn around center of European reference samples, with study samples inside this radius considered to be of European descent and samples outside this radius of non-European descent. The radius is computed as the maximum Euclidean distance of European reference samples to the centre of European reference samples. |
defaultRefSamples |
[character] Option to use pre-downloaded individual and population identifiers from either the 1000Genomes or HapMap project. If refSamples and refSamplesFile are not provided, the HapMap identifiers (or 1000Genomes is specified) will be used as default and the function will fail if the reference samples in the prefixMergedDataset do not match these reference samples. If refColors and refColorsFile are not provided, this also sets default colors for the reference populations. |
refPopulation |
[vector] Vector with population identifiers of European reference population. Identifiers have to be corresponding to population IDs [refColorsPop] in refColorsfile/refColors. |
refSamples |
[data.frame] Dataframe with sample identifiers [refSamplesIID] corresponding to IIDs in prefixMergedDataset.eigenvec and population identifier [refSamplesPop] corresponding to population IDs [refColorsPop] in refColorsfile/refColors. Either refSamples or refSamplesFile have to be specified. |
refColors |
[data.frame, optional] Dataframe with population IDs in column [refColorsPop] and corresponding colour-code for PCA plot in column [refColorsColor]. If not provided and is.null(refColorsFile) default colors are used. |
refSamplesFile |
[character] /path/to/File/with/reference samples. Needs columns with sample identifiers [refSamplesIID] corresponding to IIDs in prefixMergedDataset.eigenvec and population identifier [refSamplesPop] corresponding to population IDs [refColorsPop] in refColorsfile/refColors. |
refColorsFile |
[character, optional] /path/to/File/with/Population/Colors containing population IDs in column [refColorsPop] and corresponding colour-code for PCA plot in column [refColorsColor].If not provided and is.null(refColors) default colors for are used. |
refSamplesIID |
[character] Column name of reference sample IDs in refSamples/refSamplesFile. |
refSamplesPop |
[character] Column name of reference sample population IDs in refSamples/refSamplesFile. |
refColorsColor |
[character] Column name of population colors in refColors/refColorsFile |
refColorsPop |
[character] Column name of reference sample population IDs in refColors/refColorsFile. |
studyColor |
[character] Colour to be used for study population. |
legend_labels_per_row |
[integer] Number of population names per row in PCA plot. |
run.check_ancestry |
[logical] Should plink –pca be run to
determine principal components of merged dataset; if FALSE, it is assumed
that plink –pca has been run successfully and
qcdir/prefixMergedDataset.eigenvec is present;
|
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_ancestry) via ggplot2::ggsave(p=p_ancestry, other_arguments) or pdf(outfile) print(p_ancestry) dev.off(). |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
highlight_samples |
[character vector] Vector of sample IIDs to highlight in the plot (p_ancestry); all highlight_samples IIDs have to be present in the IIDs of the prefixMergedDataset.fam file. |
highlight_type |
[character] Type of sample highlight, labeling by IID ("text"/"label") and/or highlighting data points in different "color" and/or "shape". "text" and "label" use ggrepel for minimal overlap of text labels ("text) or label boxes ("label"). Only one of "text" and "label" can be specified.Text/Label size can be specified with highlight_text_size, highlight color with highlight_color, or highlight shape with highlight_shape. |
highlight_text_size |
[integer] Text/Label size for samples specified to be highlighted (highlight_samples) by "text" or "label" (highlight_type). |
highlight_color |
[character] Color for samples specified to be highlighted (highlight_samples) by "color" (highlight_type). |
highlight_shape |
[integer] Shape for samples specified to be highlighted (highlight_samples) by "shape" (highlight_type). Possible shapes and their encoding can be found at: https://ggplot2.tidyverse.org/articles/ggplot2-specs.html#sec:shape-spec |
highlight_legend |
[logical] Should a separate legend for the highlighted samples be provided; only relevant for highlight_type == "color" or highlight_type == "shape". |
legend_text_size |
[integer] Size for legend text. |
legend_title_size |
[integer] Size for legend title. |
axis_text_size |
[integer] Size for axis text. |
axis_title_size |
[integer] Size for axis title. |
title_size |
[integer] Size for plot title. |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
Value
Named [list] with i) fail_ancestry, containing a [data.frame] with FID and IID of non-European individuals and ii) p_ancestry, a ggplot2-object 'containing' a scatter plot of PC1 versus PC2 colour-coded for samples of the reference populations and the study population, which can be shown by print(p_ancestry).
Examples
## Not run:
indir <- system.file("extdata", package="plinkQC")
name <- "data"
fail_ancestry <- check_ancestry(indir=indir, name=name,
refSamplesFile=paste(indir, "/HapMap_ID2Pop.txt",sep=""),
refColorsFile=paste(indir, "/HapMap_PopColors.txt", sep=""),
prefixMergedDataset="data.HapMapIII", interactive=FALSE,
run.check_ancestry=FALSE)
# highlight samples
highlight_samples <- read.table(system.file("extdata", "keep_individuals",
package="plinkQC"))
fail_ancestry <- check_ancestry(indir=qcdir, name=name,
refSamplesFile=paste(qcdir, "/HapMap_ID2Pop.txt",sep=""),
refColorsFile=paste(qcdir, "/HapMap_PopColors.txt", sep=""),
prefixMergedDataset="data.HapMapIII", interactive=FALSE,
highlight_samples = highlight_samples[,2],
run.check_ancestry=FALSE,
highlight_type = c("text", "shape"))
## End(Not run)
Identification of individuals with outlying missing genotype or heterozygosity rates
Description
Runs and evaluates results from plink –missing (missing genotype rates
per individual) and plink –het (heterozygosity rates per individual).
Non-systematic failures in genotyping and outlying heterozygosity (hz) rates
per individual are often proxies for DNA sample quality. Larger than expected
heterozygosity can indicate possible DNA contamination.
The mean heterozygosity in PLINK is computed as hz_mean = (N-O)/N, where
N: number of non-missing genotypes and O:observed number of homozygous
genotypes for a given individual.
Mean heterozygosity can differ between populations and SNP genotyping panels.
Within a population and genotyping panel, a reduced heterozygosity rate can
indicate inbreeding - these individuals will then likely be returned by
check_relatedness
as individuals that fail the relatedness
filters. check_het_and_miss
creates a scatter plot with the
individuals' missingness rates on x-axis and their heterozygosity rates on
the y-axis.
Usage
check_het_and_miss(
indir,
name,
qcdir = indir,
imissTh = 0.03,
hetTh = 3,
run.check_het_and_miss = TRUE,
label_fail = TRUE,
highlight_samples = NULL,
highlight_type = c("text", "label", "color", "shape"),
highlight_text_size = 3,
highlight_color = "#c51b8a",
highlight_shape = 17,
highlight_legend = FALSE,
interactive = FALSE,
verbose = FALSE,
keep_individuals = NULL,
remove_individuals = NULL,
exclude_markers = NULL,
extract_markers = NULL,
legend_text_size = 5,
legend_title_size = 7,
axis_text_size = 5,
axis_title_size = 7,
title_size = 9,
path2plink = NULL,
showPlinkOutput = TRUE
)
Arguments
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam, name.het and name.imiss. |
qcdir |
[character] /path/to/directory where name.het as returned by plink –het and name.imiss as returned by plink –missing will be saved. Per default qcdir=indir. If run.check_het_and_miss is FALSE, it is assumed that plink –missing and plink –het have been run and qcdir/name.imiss and qcdir/name.het are present. User needs writing permission to qcdir. |
imissTh |
[double] Threshold for acceptable missing genotype rate per individual; has to be proportion between (0,1) |
hetTh |
[double] Threshold for acceptable deviation from mean heterozygosity per individual. Expressed as multiples of standard deviation of heterozygosity (het), i.e. individuals outside mean(het) +/- hetTh*sd(het) will be returned as failing heterozygosity check; has to be larger than 0. |
run.check_het_and_miss |
[logical] Should plink –missing and plink
–het be run to determine genotype missingness and heterozygosity rates; if
FALSE, it is assumed that plink –missing and plink –het have been run and
qcdir/name.imiss and qcdir/name.het are present;
|
label_fail |
[logical] Set TRUE, to add fail IDs as text labels in scatter plot. |
highlight_samples |
[character vector] Vector of sample IIDs to highlight in the plot (p_het_imiss); all highlight_samples IIDs have to be present in the IIDs of the name.fam file. |
highlight_type |
[character] Type of sample highlight, labeling by IID ("text"/"label") and/or highlighting data points in different "color" and/or "shape". "text" and "label" use ggrepel for minimal overlap of text labels ("text) or label boxes ("label"). Only one of "text" and "label" can be specified.Text/Label size can be specified with highlight_text_size, highlight color with highlight_color, or highlight shape with highlight_shape. |
highlight_text_size |
[integer] Text/Label size for samples specified to be highlighted (highlight_samples) by "text" or "label" (highlight_type). |
highlight_color |
[character] Color for samples specified to be highlighted (highlight_samples) by "color" (highlight_type). |
highlight_shape |
[integer] Shape for samples specified to be highlighted (highlight_samples) by "shape" (highlight_type). Possible shapes and their encoding can be found at: https://ggplot2.tidyverse.org/articles/ggplot2-specs.html#sec:shape-spec |
highlight_legend |
[logical] Should a separate legend for the highlighted samples be provided; only relevant for highlight_type == "color" or highlight_type == "shape". |
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_het_imiss) via ggplot2::ggsave(p=p_het_imiss , other_arguments) or pdf(outfile) print(p_het_imiss) dev.off(). |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
legend_text_size |
[integer] Size for legend text. |
legend_title_size |
[integer] Size for legend title. |
axis_text_size |
[integer] Size for axis text. |
axis_title_size |
[integer] Size for axis title. |
title_size |
[integer] Size for plot title. |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
Details
check_het_and_miss
wraps around
run_check_missingness
,
run_check_heterozygosity
and
evaluate_check_het_and_miss
.
If run.check_het_and_miss is TRUE, run_check_heterozygosity
and
run_check_missingness
are executed ; otherwise it is assumed
that plink –missing and plink –het have been run externally and
qcdir/name.het and qcdir/name.imiss exist. check_het_and_miss
will fail with missing file error otherwise.
For details on the output data.frame fail_imiss and fail_het, check the original description on the PLINK output format page: https://www.cog-genomics.org/plink/1.9/formats#imiss and https://www.cog-genomics.org/plink/1.9/formats#het
Value
Named [list] with i) fail_imiss [data.frame] containing FID (Family ID), IID (Within-family ID), MISS_PHENO (Phenotype missing? (Y/N)), N_MISS (Number of missing genotype call(s), not including obligatory missings), N_GENO (Number of potentially valid call(s)), F_MISS (Missing call rate) of individuals failing missing genotype check and ii) fail_het [data.frame] containing FID (Family ID), IID (Within-family ID), O(HOM) (Observed number of homozygotes), E(HOM) (Expected number of homozygotes), N(NM) (Number of non-missing autosomal genotypes), F (Method-of-moments F coefficient estimate) of individuals failing outlying heterozygosity check and iii) p_het_imiss, a ggplot2-object 'containing' a scatter plot with the samples' missingness rates on x-axis and their heterozygosity rates on the y-axis, which can be shown by print(p_het_imiss).
Examples
## Not run:
indir <- system.file("extdata", package="plinkQC")
name <- "data"
path2plink <- "path/to/plink"
# whole dataset
fail_het_miss <- check_het_and_miss(indir=indir, name=name,
interactive=FALSE,path2plink=path2plink)
# subset of dataset with sample highlighting
highlight_samples <- read.table(system.file("extdata", "keep_individuals",
package="plinkQC"))
remove_individuals_file <- system.file("extdata", "remove_individuals",
package="plinkQC")
fail_het_miss <- check_het_and_miss(indir=indir, name=name,
interactive=FALSE,path2plink=path2plink,
remove_individuals=remove_individuals_file,
highlight_samples=highlight_samples[,2], highlight_type = c("text", "shape"))
## End(Not run)
Identification of SNPs showing a significant deviation from Hardy-Weinberg- equilibrium (HWE)
Description
Runs and evaluates results from plink –hardy. It calculates the observed and
expected heterozygote frequencies for all variants in the individuals that
passed the perIndividualQC
and computes the deviation of the
frequencies from Hardy-Weinberg equilibrium (HWE) by HWE exact test. The
p-values of the HWE exact test are displayed as histograms (stratified by
all and low p-values), where the hweTh is used to depict the quality control
cut-off for SNPs.
Usage
check_hwe(
indir,
name,
qcdir = indir,
hweTh = 1e-05,
interactive = FALSE,
path2plink = NULL,
verbose = FALSE,
showPlinkOutput = TRUE,
keep_individuals = NULL,
remove_individuals = NULL,
exclude_markers = NULL,
extract_markers = NULL,
legend_text_size = 5,
legend_title_size = 7,
axis_text_size = 5,
axis_title_size = 7,
title_size = 9
)
Arguments
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam. |
qcdir |
[character] /path/to/directory where results will be written to.
If |
hweTh |
[double] Significance threshold for deviation from HWE. |
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_hwe) via ggplot2::ggsave(p=p_hwe, other_arguments) or pdf(outfile) print(p_hwe) dev.off(). |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
verbose |
[logical] If TRUE, progress info is printed to standard out and specifically, if TRUE, plink log will be displayed. |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
legend_text_size |
[integer] Size for legend text. |
legend_title_size |
[integer] Size for legend title. |
axis_text_size |
[integer] Size for axis text. |
axis_title_size |
[integer] Size for axis title. |
title_size |
[integer] Size for plot title. |
Details
check_hwe
uses plink –remove name.fail.IDs –hardy to
calculate the observed and expected heterozygote frequencies per SNP in the
individuals that passed the perIndividualQC
. It does so
without generating a new dataset but simply removes the IDs when calculating
the statistics.
For details on the output data.frame fail_hwe, check the original description on the PLINK output format page: https://www.cog-genomics.org/plink/1.9/formats#hwe.
Value
Named list with i) fail_hwe containing a [data.frame] with CHR (Chromosome code), SNP (Variant identifier), TEST (Type of test: one of 'ALL', 'AFF', 'UNAFF', 'ALL(QT)', 'ALL(NP)'), A1 (Allele 1; usually minor), A2 (Allele 2; usually major), GENO ('/'-separated genotype counts: A1 hom, het, A2 hom), O(HET) (Observed heterozygote frequency E(HET) (Expected heterozygote frequency), P (Hardy-Weinberg equilibrium exact test p-value) for all SNPs that failed the hweTh and ii) p_hwe, a ggplot2-object 'containing' the HWE p-value distribution histogram which can be shown by (print(p_hwe)).
Examples
indir <- system.file("extdata", package="plinkQC")
qcdir <- tempdir()
name <- "data"
path2plink <- '/path/to/plink'
# the following code is not run on package build, as the path2plink on the
# user system is not known.
## Not run:
# run on all individuals and markers
fail_hwe <- check_hwe(indir=indir, qcdir=qcdir, name=name, interactive=FALSE,
verbose=TRUE, path2plink=path2plink)
# run on subset of individuals and markers
remove_individuals_file <- system.file("extdata", "remove_individuals",
package="plinkQC")
extract_markers_file <- system.file("extdata", "extract_markers",
package="plinkQC")
fail_hwe <- check_hwe(qcdir=qcdir, indir=indir,
name=name, interactive=FALSE, verbose=TRUE, path2plink=path2plink,
remove_individuals=remove_individuals_file,
extract_markers=extract_markers_file)
## End(Not run)
Identification of SNPs with low minor allele frequency
Description
Runs and evaluates results from plink –freq. It calculates the minor allele
frequencies for all variants in the individuals that passed the
perIndividualQC
. The minor allele frequency distributions is
plotted as a histogram.
Usage
check_maf(
indir,
name,
qcdir = indir,
macTh = 20,
mafTh = NULL,
verbose = FALSE,
interactive = FALSE,
path2plink = NULL,
showPlinkOutput = TRUE,
keep_individuals = NULL,
remove_individuals = NULL,
exclude_markers = NULL,
extract_markers = NULL,
legend_text_size = 5,
legend_title_size = 7,
axis_text_size = 5,
axis_title_size = 7,
title_size = 9
)
Arguments
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam. |
qcdir |
[character] /path/to/directory where results will be written to.
If |
macTh |
[double] Threshold for minor allele cut cut-off, if both mafTh and macTh are specified, macTh is used (macTh = mafTh\*2\*NrSamples). |
mafTh |
[double] Threshold for minor allele frequency cut-off. |
verbose |
[logical] If TRUE, progress info is printed to standard out and specifically, if TRUE, plink log will be displayed. |
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_hwe) via ggplot2::ggsave(p=p_maf, other_arguments) or pdf(outfile) print(p_maf) dev.off(). |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
legend_text_size |
[integer] Size for legend text. |
legend_title_size |
[integer] Size for legend title. |
axis_text_size |
[integer] Size for axis text. |
axis_title_size |
[integer] Size for axis title. |
title_size |
[integer] Size for plot title. |
Details
check_maf
uses plink –remove name.fail.IDs –freq to calculate the
minor allele frequencies for all variants in the individuals that passed the
perIndividualQC
. It does so without generating a new dataset
but simply removes the IDs when calculating the statistics.
For details on the output data.frame fail_maf, check the original description on the PLINK output format page: https://www.cog-genomics.org/plink/1.9/formats#frq.
Value
Named list with i) fail_maf containing a [data.frame] with CHR (Chromosome code), SNP (Variant identifier), A1 (Allele 1; usually minor), A2 (Allele 2; usually major), MAF (Allele 1 frequency), NCHROBS (Number of allele observations) for all SNPs that failed the mafTh/macTh and ii) p_maf, a ggplot2-object 'containing' the MAF distribution histogram which can be shown by (print(p_maf)).
Examples
indir <- system.file("extdata", package="plinkQC")
qcdir <- tempdir()
name <- "data"
path2plink <- '/path/to/plink'
# the following code is not run on package build, as the path2plink on the
# user system is not known.
## Not run:
# run on all individuals and markers
fail_maf <- check_maf(indir=indir, qcdir=qcdir, name=name, macTh=15,
interactive=FALSE, verbose=TRUE, path2plink=path2plink)
# run on subset of individuals and markers
keep_individuals_file <- system.file("extdata", "keep_individuals",
package="plinkQC")
exclude_markers_file <- system.file("extdata", "exclude_markers",
package="plinkQC")
fail_maf <- check_maf(qcdir=qcdir, indir=indir,
name=name, interactive=FALSE, verbose=TRUE, path2plink=path2plink,
keep_individuals=keep_individuals_file, exclude_markers=exclude_markers_file)
## End(Not run)
Identification of related individuals
Description
Runs and evaluates results from plink –genome.
plink –genome calculates identity by state (IBS) for each pair of
individuals based on the average proportion of alleles shared at genotyped
SNPs. The degree of recent shared ancestry, i.e. the identity by descent
(IBD) can be estimated from the genome-wide IBS. The proportion of IBD
between two individuals is returned by plink –genome as PI_HAT.
check_relatedness finds pairs of samples whose proportion of IBD is larger
than the specified highIBDTh. Subsequently, for pairs of individuals that do
not have additional relatives in the dataset, the individual with the greater
genotype missingness rate is selected and returned as the individual failing
the relatedness check. For more complex family structures, the unrelated
individuals per family are selected (e.g. in a parents-offspring trio, the
offspring will be marked as fail, while the parents will be kept in the
analysis).
check_relatedness
depicts all pair-wise IBD-estimates as histograms
stratified by value of PI_HAT.
Usage
check_relatedness(
indir,
name,
qcdir = indir,
highIBDTh = 0.1875,
filter_high_ldregion = TRUE,
high_ldregion_file = NULL,
genomebuild = "hg19",
imissTh = 0.03,
run.check_relatedness = TRUE,
interactive = FALSE,
verbose = FALSE,
mafThRelatedness = 0.1,
path2plink = NULL,
keep_individuals = NULL,
remove_individuals = NULL,
exclude_markers = NULL,
extract_markers = NULL,
legend_text_size = 5,
legend_title_size = 7,
axis_text_size = 5,
axis_title_size = 7,
title_size = 9,
showPlinkOutput = TRUE
)
Arguments
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam, name.genome and name.imiss. |
qcdir |
[character] /path/to/directory to where name.genome as returned by plink –genome will be saved. Per default qcdir=indir. If run.check_relatedness is FALSE, it is assumed that plink –missing and plink –genome have been run and qcdir/name.imiss and qcdir/name.genome exist. User needs writing permission to qcdir. |
highIBDTh |
[double] Threshold for acceptable proportion of IBD between pair of individuals. |
filter_high_ldregion |
[logical] Should high LD regions be filtered
before IBD estimation; carried out per default with high LD regions for
hg19 provided as default via |
high_ldregion_file |
[character] Path to file with high LD regions used
for filtering before IBD estimation if |
genomebuild |
[character] Name of the genome build of the PLINK file annotations, ie mappings in the name.bim file. Will be used to remove high-LD regions based on the coordinates of the respective build. Options are hg18, hg19 and hg38. See @details. |
imissTh |
[double] Threshold for acceptable missing genotype rate in any individual; has to be proportion between (0,1) |
run.check_relatedness |
[logical] Should plink –genome be run to
determine pairwise IBD of individuals; if FALSE, it is assumed that
plink –genome and plink –missing have been run and qcdir/name.imiss and
qcdir/name.genome are present;
|
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_IBD() via ggplot2::ggsave(p=p_IBD, other_arguments) or pdf(outfile) print(p_IBD) dev.off(). |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
mafThRelatedness |
[double] Threshold of minor allele frequency filter for selecting variants for IBD estimation. |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
legend_text_size |
[integer] Size for legend text. |
legend_title_size |
[integer] Size for legend title. |
axis_text_size |
[integer] Size for axis text. |
axis_title_size |
[integer] Size for axis title. |
title_size |
[integer] Size for plot title. |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
Details
check_relatedness
wraps around
run_check_relatedness
and
evaluate_check_relatedness
. If run.check_relatedness is TRUE,
run_check_relatedness
is executed ; otherwise it is assumed that
plink –genome has been run externally and qcdir/name.genome exists.
check_relatedness
will fail with missing file error otherwise.
For details on the output data.frame fail_high_IBD, check the original description on the PLINK output format page: https://www.cog-genomics.org/plink/1.9/formats#genome.
Value
Named [list] with i) fail_high_IBD containing a [data.frame] of IIDs and FIDs of individuals who fail the IBDTh in columns FID1 and IID1. In addition, the following columns are returned (as originally obtained by plink –genome): FID2 (Family ID for second sample), IID2 (Individual ID for second sample), RT (Relationship type inferred from .fam/.ped file), EZ (IBD sharing expected value, based on just .fam/.ped relationship), Z0 (P(IBD=0)), Z1 (P(IBD=1)), Z2 (P(IBD=2)), PI_HAT (Proportion IBD, i.e. P(IBD=2) + 0.5*P(IBD=1)), PHE (Pairwise phenotypic code (1, 0, -1 = AA, AU, and UU pairs, respectively)), DST (IBS distance, i.e. (IBS2 + 0.5*IBS1) / (IBS0 + IBS1 + IBS2)), PPC (IBS binomial test), RATIO (HETHET : IBS0 SNP ratio (expected value 2)). and ii) failIDs containing a [data.frame] with individual IDs [IID] and family IDs [FID] of individuals failing the highIBDTh iii) p_IBD, a ggplot2-object 'containing' all pair-wise IBD-estimates as histograms stratified by value of PI_HAT, which can be shown by print(p_IBD).
Examples
## Not run:
indir <- system.file("extdata", package="plinkQC")
name <- 'data'
path2plink <- "path/to/plink"
# whole dataset
relatednessQC <- check_relatedness(indir=indir, name=name, interactive=FALSE,
run.check_relatedness=FALSE, path2plink=path2plink)
# subset of dataset
remove_individuals_file <- system.file("extdata", "remove_individuals",
package="plinkQC")
fail_relatedness <- check_relatedness(indir=qcdir, name=name,
remove_individuals=remove_individuals_file, path2plink=path2plink)
## End(Not run)
Identification of individuals with discordant sex information
Description
Runs and evaluates results from plink –check-sex.
check_sex
returns IIDs for individuals whose SNPSEX != PEDSEX
(where the SNPSEX is determined by the heterozygosity rate across
X-chromosomal variants).
Mismatching SNPSEX and PEDSEX IDs can indicate plating errors, sample-mixup
or generally samples with poor genotyping. In the latter case, these IDs are
likely to fail other QC steps as well.
Optionally, an extra data.frame (externalSex) with sample IDs and sex can be
provided to double check if external and PEDSEX data (often processed at
different centers) match. If a mismatch between PEDSEX and SNPSEX was
detected, while SNPSEX == Sex, PEDSEX of these individuals can optionally be
updated (fixMixup=TRUE).
check_sex
depicts the X-chromosomal heterozygosity (SNPSEX) of
the individuals split by their (PEDSEX).
Usage
check_sex(
indir,
name,
qcdir = indir,
maleTh = 0.8,
femaleTh = 0.2,
run.check_sex = TRUE,
externalSex = NULL,
externalFemale = "F",
externalMale = "M",
externalSexSex = "Sex",
externalSexID = "IID",
fixMixup = FALSE,
interactive = FALSE,
verbose = FALSE,
label_fail = TRUE,
highlight_samples = NULL,
highlight_type = c("text", "label", "color", "shape"),
highlight_text_size = 3,
highlight_color = "#c51b8a",
highlight_shape = 17,
highlight_legend = FALSE,
path2plink = NULL,
keep_individuals = NULL,
remove_individuals = NULL,
exclude_markers = NULL,
extract_markers = NULL,
legend_text_size = 5,
legend_title_size = 7,
axis_text_size = 5,
axis_title_size = 7,
title_size = 9,
showPlinkOutput = TRUE
)
Arguments
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam and name.sexcheck. |
qcdir |
[character] /path/to/directory to save name.sexcheck as returned by plink –check-sex. Per default qcdir=indir. If run.check_sex is FALSE, it is assumed that plink –check-sex has been run and qcdir/name.sexcheck is present. User needs writing permission to qcdir. |
maleTh |
[double] Threshold of X-chromosomal heterozygosity rate for males. |
femaleTh |
[double] Threshold of X-chromosomal heterozygosity rate for females. |
run.check_sex |
[logical] Should plink –check-sex be run? if set to
FALSE, it is assumed that plink –check-sex has been run and
qcdir/name.sexcheck is present; |
externalSex |
[data.frame, optional] Dataframe with sample IDs [externalSexID] and sex [externalSexSex] to double check if external and PEDSEX data (often processed at different centers) match. |
externalFemale |
[integer/character] Identifier for 'female' in externalSex. |
externalMale |
[integer/character] Identifier for 'male' in externalSex. |
externalSexSex |
[character] Column identifier for column containing sex information in externalSex. |
externalSexID |
[character] Column identifier for column containing ID information in externalSex. |
fixMixup |
[logical] Should PEDSEX of individuals with mismatch between PEDSEX and Sex while Sex==SNPSEX automatically corrected: this will directly change the name.bim/.bed/.fam files! |
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_sexcheck) via ggplot2::ggsave(p=p_sexcheck, other_arguments) or pdf(outfile) print(p_sexcheck) dev.off(). |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
label_fail |
[logical] Set TRUE, to add fail IDs as text labels in scatter plot. |
highlight_samples |
[character vector] Vector of sample IIDs to highlight in the plot (p_sexcheck); all highlight_samples IIDs have to be present in the IIDs of the name.fam file. |
highlight_type |
[character] Type of sample highlight, labeling by IID ("text"/"label") and/or highlighting data points in different "color" and/or "shape". "text" and "label" use ggrepel for minimal overlap of text labels ("text) or label boxes ("label"). Only one of "text" and "label" can be specified. Text/Label size can be specified with highlight_text_size, highlight color with highlight_color, or highlight shape with highlight_shape. |
highlight_text_size |
[integer] Text/Label size for samples specified to be highlighted (highlight_samples) by "text" or "label" (highlight_type). |
highlight_color |
[character] Color for samples specified to be highlighted (highlight_samples) by "color" (highlight_type). |
highlight_shape |
[integer] Shape for samples specified to be highlighted (highlight_samples) by "shape" (highlight_type). Possible shapes and their encoding can be found at: https://ggplot2.tidyverse.org/articles/ggplot2-specs.html#sec:shape-spec |
highlight_legend |
[logical] Should a separate legend for the highlighted samples be provided; only relevant for highlight_type == "color" or highlight_type == "shape". |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
legend_text_size |
[integer] Size for legend text. |
legend_title_size |
[integer] Size for legend title. |
axis_text_size |
[integer] Size for axis text. |
axis_title_size |
[integer] Size for axis title. |
title_size |
[integer] Size for plot title. |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
Details
check_sex
wraps around run_check_sex
and
evaluate_check_sex
. If run.check_sex is TRUE,
run_check_sex
is executed ; otherwise it is assumed that plink
–check-sex has been run externally and qcdir/name.sexcheck exists.
check_sex
will fail with missing file error otherwise.
For details on the output data.frame fail_sex, check the original description on the PLINK output format page: https://www.cog-genomics.org/plink/1.9/formats#sexcheck.
Value
Named list with i) fail_sex: [data.frame] with FID, IID, PEDSEX, SNPSEX and Sex (if externalSex was provided) of individuals failing sex check, ii) mixup: dataframe with FID, IID, PEDSEX, SNPSEX and Sex (if externalSex was provided) of individuals whose PEDSEX != Sex and Sex == SNPSEX and iii) p_sexcheck, a ggplot2-object 'containing' a scatter plot of the X-chromosomal heterozygosity (SNPSEX) of the sample split by their (PEDSEX), which can be shown by print(p_sexcheck).
Examples
## Not run:
indir <- system.file("extdata", package="plinkQC")
name <- "data"
# whole dataset
fail_sex <- check_sex(indir=indir, name=name, run.check_sex=FALSE,
interactive=FALSE, verbose=FALSE)
# subset of dataset with sample highlighting
highlight_samples <- read.table(system.file("extdata", "keep_individuals",
package="plinkQC"))
remove_individuals_file <- system.file("extdata", "remove_individuals",
package="plinkQC")
fail_sex <- check_sex(indir=indir, name=name,
interactive=FALSE, path2plink=path2plink,
remove_individuals=remove_individuals_file,
highlight_samples=highlight_samples[,2], highlight_type = c("text", "shape"))
## End(Not run)
Identification of SNPs with high missingness rate
Description
Runs and evaluates results from plink –missing –freq. It calculate the
rates of missing genotype calls and frequency for all variants in the
individuals that passed the perIndividualQC
. The SNP
missingness rates (stratified by minor allele frequency) are depicted as
histograms.
Usage
check_snp_missingness(
indir,
name,
qcdir = indir,
lmissTh = 0.01,
interactive = FALSE,
path2plink = NULL,
verbose = FALSE,
showPlinkOutput = TRUE,
keep_individuals = NULL,
remove_individuals = NULL,
exclude_markers = NULL,
extract_markers = NULL,
legend_text_size = 5,
legend_title_size = 7,
axis_text_size = 5,
axis_title_size = 7,
title_size = 9
)
Arguments
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam. |
qcdir |
[character] /path/to/directory where results will be written to.
If |
lmissTh |
[double] Threshold for acceptable variant missing rate across samples. |
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_lmiss) via ggplot2::ggsave(p=p_lmiss, other_arguments) or pdf(outfile) print(p_lmiss) dev.off(). |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
verbose |
[logical] If TRUE, progress info is printed to standard out and specifically, if TRUE, plink log will be displayed. |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
legend_text_size |
[integer] Size for legend text. |
legend_title_size |
[integer] Size for legend title. |
axis_text_size |
[integer] Size for axis text. |
axis_title_size |
[integer] Size for axis title. |
title_size |
[integer] Size for plot title. |
Details
check_snp_missingness
uses plink –remove name.fail.IDs –missing
–freq to calculate rates of missing genotype calls and frequency per SNP in
the individuals that passed the perIndividualQC
. It does so
without generating a new dataset but simply removes the IDs when calculating
the statistics.
For details on the output data.frame fail_missingness, check the original description on the PLINK output format page: https://www.cog-genomics.org/plink/1.9/formats#lmiss.
Value
Named list with i) fail_missingness containing a [data.frame] with CHR (Chromosome code), SNP (Variant identifier), CLST (Cluster identifier. Only present with –within/–family), N_MISS (Number of missing genotype call(s), not counting obligatory missings), N_CLST (Cluster size; does not include nonmales on Ychr; Only present with –within/–family), N_GENO (Number of potentially valid call(s)), F_MISS (Missing call rate) for all SNPs failing the lmissTh and ii) p_lmiss, a ggplot2-object 'containing' the SNP missingness histogram which can be shown by (print(p_lmiss)).
Examples
indir <- system.file("extdata", package="plinkQC")
qcdir <- tempdir()
name <- "data"
path2plink <- '/path/to/plink'
# the following code is not run on package build, as the path2plink on the
# user system is not known.
## Not run:
# run on all individuals and markers
fail_snp_missingness <- check_snp_missingness(qcdir=qcdir, indir=indir,
name=name, interactive=FALSE, verbose=TRUE, path2plink=path2plink)
# run on subset of individuals and markers
keep_individuals_file <- system.file("extdata", "keep_individuals",
package="plinkQC")
extract_markers_file <- system.file("extdata", "extract_markers",
package="plinkQC")
fail_snp_missingness <- check_snp_missingness(qcdir=qcdir, indir=indir,
name=name, interactive=FALSE, verbose=TRUE, path2plink=path2plink,
keep_individuals=keep_individuals_file, extract_markers=extract_markers_file)
## End(Not run)
Create plink dataset with individuals and markers passing quality control
Description
Individuals that fail per-individual QC and markers that fail per-marker QC are removed from indir/name.bim/.bed/.fam and a new, dataset with the remaining individuals and markers is created as qcdir/name.clean.bim/.bed/.fam.
Usage
cleanData(
indir,
name,
qcdir = indir,
filterSex = TRUE,
filterHeterozygosity = TRUE,
filterSampleMissingness = TRUE,
filterAncestry = TRUE,
filterRelated = TRUE,
filterSNPMissingness = TRUE,
lmissTh = 0.01,
filterHWE = TRUE,
hweTh = 1e-05,
filterMAF = TRUE,
macTh = 20,
mafTh = NULL,
path2plink = NULL,
verbose = FALSE,
keep_individuals = NULL,
remove_individuals = NULL,
exclude_markers = NULL,
extract_markers = NULL,
showPlinkOutput = TRUE
)
Arguments
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam. |
qcdir |
[character] /path/to/directory where results will be written to.
If |
filterSex |
[logical] Set to exclude samples that failed the sex
check (via |
filterHeterozygosity |
[logical] Set to exclude samples that failed
check for outlying heterozygosity rates (via
|
filterSampleMissingness |
[logical] Set to exclude samples that failed
check for excessive missing genotype rates (via
|
filterAncestry |
[logical] Set to exclude samples that failed ancestry
check (via |
filterRelated |
[logical] Set to exclude samples that failed relatedness
check (via |
filterSNPMissingness |
[logical] Set to exclude markers that have
excessive missing rates across samples (via
|
lmissTh |
[double] Threshold for acceptable variant missing rate across samples. |
filterHWE |
[logical] Set to exclude markers that fail HWE exact test
(via |
hweTh |
[double] Significance threshold for deviation from HWE. |
filterMAF |
[logical] Set to exclude markers that fail minor allele
frequency or minor allele count threshold (via |
macTh |
[double] Threshold for minor allele cut cut-off, if both mafTh and macTh are specified, macTh is used (macTh = mafTh\*2\*NrSamples). |
mafTh |
[double] Threshold for minor allele frequency cut-off. |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
Value
names [list] with i) passIDs, containing a [data.frame] with family [FID] and individual [IID] IDs of samples that pass the QC, ii) failIDs, containing a [data.frame] with family [FID] and individual [IID] IDs of samples that fail the QC.
Examples
package.dir <- find.package('plinkQC')
indir <- file.path(package.dir, 'extdata')
qcdir <- tempdir()
name <- "data"
path2plink <- '/path/to/plink'
# the following code is not run on package build, as the path2plink on the
# user system is not known.
## Not run:
# Run qc on all samples and markers in the dataset
## Run individual QC checks
fail_individuals <- perIndividualQC(indir=indir, qcdir=qcdir, name=name,
refSamplesFile=paste(qcdir, "/HapMap_ID2Pop.txt",sep=""),
refColorsFile=paste(qcdir, "/HapMap_PopColors.txt", sep=""),
prefixMergedDataset="data.HapMapIII", interactive=FALSE, verbose=FALSE,
path2plink=path2plink)
## Run marker QC checks
fail_markers <- perMarkerQC(indir=indir, qcdir=qcdir, name=name,
path2plink=path2plink)
## Create new dataset of individuals and markers passing QC
ids_all <- cleanData(indir=indir, qcdir=qcdir, name=name, macTh=15,
verbose=TRUE, path2plink=path2plink, filterAncestry=FALSE,
filterRelated=TRUE)
# Run qc on subset of samples and markers in the dataset
highlight_samples <- read.table(system.file("extdata", "keep_individuals",
package="plinkQC"))
remove_individuals_file <- system.file("extdata", "remove_individuals",
package="plinkQC")
fail_individuals <- perIndividualQC(indir=indir, qcdir=qcdir, name=name,
dont.check_ancestry = TRUE, interactive=FALSE, verbose=FALSE,
highlight_samples = highlight_samples[,2], highlight_type = "label",
remove_individuals = remove_individuals_file, path2plink=path2plink)
## Run marker QC checks
fail_markers <- perMarkerQC(indir=indir, qcdir=qcdir, name=name,
path2plink=path2plink)
## Create new dataset of individuals and markers passing QC
ids_all <- cleanData(indir=indir, qcdir=qcdir, name=name, macTh=15,
verbose=TRUE, path2plink=path2plink, filterAncestry=FALSE,
remove_individuals = remove_individuals_file)
## End(Not run)
Evaluate results from PLINK PCA on combined study and reference data
Description
Evaluates and depicts results of plink –pca (via
run_check_ancestry
or externally conducted pca) on merged
genotypes from individuals to be QCed and individuals of reference population
of known genotypes. Currently, check ancestry only supports automatic
selection of individuals of European descent. It uses information from
principal components 1 and 2 returned by plink –pca to find the center of
the European reference samples (mean(PC1_europeanRef), mean(PC2_europeanRef).
It computes the maximum Euclidean distance (maxDist) of the European
reference samples from this centre. All study samples whose Euclidean
distance from the centre falls outside the circle described by the radius
r=europeanTh* maxDist are considered non-European and their IDs are returned
as failing the ancestry check.
check_ancestry creates a scatter plot of PC1 versus PC2 colour-coded for
samples of the reference populations and the study population.
Usage
evaluate_check_ancestry(
indir,
name,
prefixMergedDataset,
qcdir = indir,
europeanTh = 1.5,
defaultRefSamples = c("HapMap", "1000Genomes"),
refSamples = NULL,
refColors = NULL,
refSamplesFile = NULL,
refColorsFile = NULL,
refSamplesIID = "IID",
refSamplesPop = "Pop",
refColorsColor = "Color",
refColorsPop = "Pop",
studyColor = "#2c7bb6",
refPopulation = c("CEU", "TSI"),
legend_labels_per_row = 6,
legend_text_size = 5,
legend_title_size = 7,
axis_text_size = 5,
axis_title_size = 7,
title_size = 9,
highlight_samples = NULL,
highlight_type = c("text", "label", "color", "shape"),
highlight_text_size = 3,
highlight_color = "#c51b8a",
highlight_shape = 17,
highlight_legend = FALSE,
interactive = FALSE,
verbose = FALSE
)
Arguments
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam. |
prefixMergedDataset |
[character] Prefix of merged dataset (study and reference samples) used in plink –pca, resulting in prefixMergedDataset.eigenvec. |
qcdir |
[character] /path/to/directory/with/QC/results containing prefixMergedDataset.eigenvec results as returned by plink –pca. Per default qcdir=indir. |
europeanTh |
[double] Scaling factor of radius to be drawn around center of European reference samples, with study samples inside this radius considered to be of European descent and samples outside this radius of non-European descent. The radius is computed as the maximum Euclidean distance of European reference samples to the centre of European reference samples. |
defaultRefSamples |
[character] Option to use pre-downloaded individual and population identifiers from either the 1000Genomes or HapMap project. If refSamples and refSamplesFile are not provided, the HapMap identifiers (or 1000Genomes is specified) will be used as default and the function will fail if the reference samples in the prefixMergedDataset do not match these reference samples. If refColors and refColorsFile are not provided, this also sets default colors for the reference populations. |
refSamples |
[data.frame] Dataframe with sample identifiers [refSamplesIID] corresponding to IIDs in prefixMergedDataset.eigenvec and population identifier [refSamplesPop] corresponding to population IDs [refColorsPop] in refColorsfile/refColors. If refSamples and refSamplesFile are not specified, defaultRefSamples will be used as reference. |
refColors |
[data.frame, optional] Dataframe with population IDs in column [refColorsPop] and corresponding colour-code for PCA plot in column [refColorsColor]. If refColors and refColorsFile are not specified and refSamples and refSamplesFile are not specified, default colors will be determined from the defaultRefSamples option. If refColors and refColorsFile are not specified and but refSamples or refSamplesFile are given, ggplot default colors will be used. |
refSamplesFile |
[character] /path/to/File/with/reference samples. Needs columns with sample identifiers [refSamplesIID] corresponding to IIDs in prefixMergedDataset.eigenvec and population identifier [refSamplesPop] corresponding to population IDs [refColorsPop] in refColorsfile/refColors. If both refSamplesFile and refSamples are not NULL, defaultRefSamples information is used. |
refColorsFile |
[character, optional] /path/to/File/with/Population/Colors containing population IDs in column [refColorsPop] and corresponding colour-code for PCA plot in column [refColorsColor]. If refColors and refColorsFile are not specified and refSamples and refSamplesFile are not specified, default colors will be determined from the defaultRefSamples option. If refColors and refColorsFile are not specified and but refSamples or refSamplesFile are given, ggplot default colors will be used. |
refSamplesIID |
[character] Column name of reference sample IDs in refSamples/refSamplesFile. |
refSamplesPop |
[character] Column name of reference sample population IDs in refSamples/refSamplesFile. |
refColorsColor |
[character] Column name of population colors in refColors/refColorsFile |
refColorsPop |
[character] Column name of reference sample population IDs in refColors/refColorsFile. |
studyColor |
[character] Colour to be used for study population if plot is TRUE. |
refPopulation |
[vector] Vector with population identifiers of European reference population. Identifiers have to be corresponding to population IDs [refColorsPop] in refColorsfile/refColors. |
legend_labels_per_row |
[integer] Number of population names per row in PCA plot. |
legend_text_size |
[integer] Size for legend text. |
legend_title_size |
[integer] Size for legend title. |
axis_text_size |
[integer] Size for axis text. |
axis_title_size |
[integer] Size for axis title. |
title_size |
[integer] Size for plot title. |
highlight_samples |
[character vector] Vector of sample IIDs to highlight in the plot (p_ancestry); all highlight_samples IIDs have to be present in the IIDs of the prefixMergedDataset.fam file. |
highlight_type |
[character] Type of sample highlight, labeling by IID ("text"/"label") and/or highlighting data points in different "color" and/or "shape". "text" and "label" use ggrepel for minimal overlap of text labels ("text) or label boxes ("label"). Only one of "text" and "label" can be specified.Text/Label size can be specified with highlight_text_size, highlight color with highlight_color, or highlight shape with highlight_shape. |
highlight_text_size |
[integer] Text/Label size for samples specified to be highlighted (highlight_samples) by "text" or "label" (highlight_type). |
highlight_color |
[character] Color for samples specified to be highlighted (highlight_samples) by "color" (highlight_type). |
highlight_shape |
[integer] Shape for samples specified to be highlighted (highlight_samples) by "shape" (highlight_type). Possible shapes and their encoding can be found at: https://ggplot2.tidyverse.org/articles/ggplot2-specs.html#sec:shape-spec |
highlight_legend |
[logical] Should a separate legend for the highlighted samples be provided; only relevant for highlight_type == "color" or highlight_type == "shape". |
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_ancestry) via ggplot2::ggsave(p=p_ancestry, other_arguments) or pdf(outfile) print(p_ancestry) dev.off(). |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
Details
Both run_check_ancestry
and
evaluate_check_ancestry
can simply be invoked by
check_ancestry
.
1000 Genomes samples were downloaded from https://www.internationalgenome.org/category/sample/, HapMap Phase 3 samples were downloaded from https://www.broadinstitute.org/medical-and-population-genetics/hapmap-3.
Value
Named [list] with i) fail_ancestry, containing a [data.frame] with FID and IID of non-European individuals and ii) p_ancestry, a ggplot2-object 'containing' a scatter plot of PC1 versus PC2 colour-coded for samples of the reference populations and the study population, which can be shown by print(p_ancestry) and iii) plot_data, a data.frame with the data visualised in p_ancestry (ii).
Examples
## Not run:
qcdir <- system.file("extdata", package="plinkQC")
name <- "data"
# whole dataset
fail_ancestry <- evaluate_check_ancestry(indir=qcdir, name=name,
refSamplesFile=paste(qcdir, "/HapMap_ID2Pop.txt",sep=""),
refColorsFile=paste(qcdir, "/HapMap_PopColors.txt", sep=""),
prefixMergedDataset="data.HapMapIII", interactive=FALSE)
# highlight samples
highlight_samples <- read.table(system.file("extdata", "keep_individuals",
package="plinkQC"))
fail_ancestry <- evaluate_check_ancestry(indir=qcdir, name=name,
refSamplesFile=paste(qcdir, "/HapMap_ID2Pop.txt",sep=""),
refColorsFile=paste(qcdir, "/HapMap_PopColors.txt", sep=""),
prefixMergedDataset="data.HapMapIII", interactive=FALSE,
highlight_samples = highlight_samples[,2],
highlight_type = c("text", "shape"))
## End(Not run)
Evaluate results from PLINK missing genotype and heterozygosity rate check.
Description
Evaluates and depicts results from plink –missing (missing genotype rates
per individual) and plink –het (heterozygosity rates per individual) via
run_check_heterozygosity
and
run_check_missingness
or externally conducted check.)
Non-systematic failures in genotyping and outlying heterozygosity (hz) rates
per individual are often proxies for DNA sample quality. Larger than expected
heterozygosity can indicate possible DNA contamination.
The mean heterozygosity in PLINK is computed as hz_mean = (N-O)/N, where
N: number of non-missing genotypes and O:observed number of homozygous
genotypes for a given individual.
Mean heterozygosity can differ between populations and SNP genotyping panels.
Within a population and genotyping panel, a reduced heterozygosity rate can
indicate inbreeding - these individuals will then be returned by
check_relatedness as individuals that fail the relatedness filters.
evaluate_check_het_and_miss
creates a scatter plot with the
individuals' missingness rates on x-axis and their heterozygosity rates on
the y-axis.
Usage
evaluate_check_het_and_miss(
qcdir,
name,
imissTh = 0.03,
hetTh = 3,
label_fail = TRUE,
highlight_samples = NULL,
highlight_type = c("text", "label", "color", "shape"),
highlight_text_size = 3,
highlight_color = "#c51b8a",
highlight_shape = 17,
legend_text_size = 5,
legend_title_size = 7,
axis_text_size = 5,
axis_title_size = 7,
title_size = 9,
highlight_legend = FALSE,
interactive = FALSE
)
Arguments
qcdir |
[character] path/to/directory/with/QC/results containing name.imiss and name.het results as returned by plink –missing and plink –het. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam, name.het and name.imiss. |
imissTh |
[double] Threshold for acceptable missing genotype rate in any individual; has to be proportion between (0,1) |
hetTh |
[double] Threshold for acceptable deviation from mean heterozygosity in any individual. Expressed as multiples of standard deviation of heterozygosity (het), i.e. individuals outside mean(het) +/- hetTh*sd(het) will be returned as failing heterozygosity check; has to be larger than 0. |
label_fail |
[logical] Set TRUE, to add fail IDs as text labels in scatter plot. |
highlight_samples |
[character vector] Vector of sample IIDs to highlight in the plot (p_het_imiss); all highlight_samples IIDs have to be present in the IIDs of the name.fam file. |
highlight_type |
[character] Type of sample highlight, labeling by IID ("text"/"label") and/or highlighting data points in different "color" and/or "shape". "text" and "label" use ggrepel for minimal overlap of text labels ("text) or label boxes ("label"). Only one of "text" and "label" can be specified.Text/Label size can be specified with highlight_text_size, highlight color with highlight_color, or highlight shape with highlight_shape. |
highlight_text_size |
[integer] Text/Label size for samples specified to be highlighted (highlight_samples) by "text" or "label" (highlight_type). |
highlight_color |
[character] Color for samples specified to be highlighted (highlight_samples) by "color" (highlight_type). |
highlight_shape |
[integer] Shape for samples specified to be highlighted (highlight_samples) by "shape" (highlight_type). Possible shapes and their encoding can be found at: https://ggplot2.tidyverse.org/articles/ggplot2-specs.html#sec:shape-spec |
legend_text_size |
[integer] Size for legend text. |
legend_title_size |
[integer] Size for legend title. |
axis_text_size |
[integer] Size for axis text. |
axis_title_size |
[integer] Size for axis title. |
title_size |
[integer] Size for plot title. |
highlight_legend |
[logical] Should a separate legend for the highlighted samples be provided; only relevant for highlight_type == "color" or highlight_type == "shape". |
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_het_imiss) via ggplot2::ggsave(p=p_het_imiss , other_arguments) or pdf(outfile) print(p_het_imiss) dev.off(). |
Details
All, run_check_heterozygosity
,
run_check_missingness
and
evaluate_check_het_and_miss
can simply be invoked by
check_het_and_miss
.
For details on the output data.frame fail_imiss and fail_het, check the original description on the PLINK output format page: https://www.cog-genomics.org/plink/1.9/formats#imiss and https://www.cog-genomics.org/plink/1.9/formats#het
Value
named [list] with i) fail_imiss dataframe containing FID (Family ID), IID (Within-family ID), MISS_PHENO (Phenotype missing? (Y/N)), N_MISS (Number of missing genotype call(s), not including obligatory missings), N_GENO ( Number of potentially valid call(s)), F_MISS (Missing call rate) of individuals failing missing genotype check and ii) fail_het dataframe containing FID (Family ID), IID (Within-family ID), O(HOM) (Observed number of homozygotes), E(HOM) (Expected number of homozygotes), N(NM) (Number of non-missing autosomal genotypes), F (Method-of-moments F coefficient estimate) of individuals failing outlying heterozygosity check; iii) p_het_imiss, a ggplot2-object 'containing' a scatter plot with the samples' missingness rates on x-axis and their heterozygosity rates on the y-axis, which can be shown by print(p_het_imiss) and iv) plot_data, a data.frame with the data visualised in p_het_imiss (iii).
Examples
qcdir <- system.file("extdata", package="plinkQC")
name <- "data"
## Not run:
fail_het_miss <- evaluate_check_het_and_miss(qcdir=qcdir, name=name,
interactive=FALSE)
#' # highlight samples
highlight_samples <- read.table(system.file("extdata", "keep_individuals",
package="plinkQC"))
fail_het_miss <- evaluate_check_het_and_miss(qcdir=qcdir, name=name,
interactive=FALSE, highlight_samples = highlight_samples[,2],
highlight_type = c("text", "color"))
## End(Not run)
Evaluate results from PLINK IBD estimation.
Description
Evaluates and depicts results from plink –genome on the LD pruned dataset
(via run_check_relatedness
or externally conducted IBD
estimation). plink –genome calculates identity by state (IBS) for each pair
of individuals based on the average proportion of alleles shared at genotyped
SNPs. The degree of recent shared ancestry, i.e. the identity by descent
(IBD) can be estimated from the genome-wide IBS. The proportion of IBD
between two individuals is returned by –genome as PI_HAT.
evaluate_check_relatedness
finds pairs of samples whose proportion of
IBD is larger than the specified highIBDTh. Subsequently, for pairs of
individual that do not have additional relatives in the dataset, the
individual with the greater genotype missingness rate is selected and
returned as the individual failing the relatedness check. For more complex
family structures, the unrelated individuals per family are selected (e.g. in
a parents-offspring trio, the offspring will be marked as fail, while the
parents will be kept in the analysis).
evaluate_check_relatedness
depicts all pair-wise IBD-estimates as
histograms stratified by value of PI_HAT.
Usage
evaluate_check_relatedness(
qcdir,
name,
highIBDTh = 0.1875,
imissTh = 0.03,
interactive = FALSE,
legend_text_size = 5,
legend_title_size = 7,
axis_text_size = 5,
axis_title_size = 7,
title_size = 9,
verbose = FALSE
)
Arguments
qcdir |
[character] path/to/directory/with/QC/results containing name.imiss and name.genome results as returned by plink –missing and plink –genome. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam, name.genome and name.imiss. |
highIBDTh |
[double] Threshold for acceptable proportion of IBD between pair of individuals. |
imissTh |
[double] Threshold for acceptable missing genotype rate in any individual; has to be proportion between (0,1) |
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_IBD() via ggplot2::ggsave(p=p_IBD, other_arguments) or pdf(outfile) print(p_IBD) dev.off(). |
legend_text_size |
[integer] Size for legend text. |
legend_title_size |
[integer] Size for legend title. |
axis_text_size |
[integer] Size for axis text. |
axis_title_size |
[integer] Size for axis title. |
title_size |
[integer] Size for plot title. |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
Details
Both run_check_relatedness
and
evaluate_check_relatedness
can simply be invoked by
check_relatedness
.
For details on the output data.frame fail_high_IBD, check the original description on the PLINK output format page: https://www.cog-genomics.org/plink/1.9/formats#genome.
Value
a named [list] with i) fail_high_IBD containing a [data.frame] of IIDs and FIDs of individuals who fail the IBDTh in columns FID1 and IID1. In addition, the following columns are returned (as originally obtained by plink –genome): FID2 (Family ID for second sample), IID2 (Individual ID for second sample), RT (Relationship type inferred from .fam/.ped file), EZ (IBD sharing expected value, based on just .fam/.ped relationship), Z0 (P(IBD=0)), Z1 (P(IBD=1)), Z2 (P(IBD=2)), PI_HAT (Proportion IBD, i.e. P(IBD=2) + 0.5*P(IBD=1)), PHE (Pairwise phenotypic code (1, 0, -1 = AA, AU, and UU pairs, respectively)), DST (IBS distance, i.e. (IBS2 + 0.5*IBS1) / (IBS0 + IBS1 + IBS2)), PPC (IBS binomial test), RATIO (HETHET : IBS0 SNP ratio (expected value 2)). and ii) failIDs containing a [data.frame] with individual IDs [IID] and family IDs [FID] of individuals failing the highIBDTh; iii) p_IBD, a ggplot2-object 'containing' all pair-wise IBD-estimates as histograms stratified by value of PI_HAT, which can be shown by print(p_IBD and iv) plot_data, a data.frame with the data visualised in p_IBD (iii).
Examples
qcdir <- system.file("extdata", package="plinkQC")
name <- 'data'
## Not run:
relatednessQC <- evaluate_check_relatedness(qcdir=qcdir, name=name,
interactive=FALSE)
## End(Not run)
Evaluate results from PLINK sex check.
Description
Evaluates and depicts results from plink –check-sex (via
run_check_sex
or externally conducted sex check).
Takes file qcdir/name.sexcheck and returns IIDs for samples whose
SNPSEX != PEDSEX (where the SNPSEX is determined by the heterozygosity rate
across X-chromosomal variants).
Mismatching SNPSEX and PEDSEX IDs can indicate plating errors, sample-mixup
or generally samples with poor genotyping. In the latter case, these IDs are
likely to fail other QC steps as well.
Optionally, an extra data.frame (externalSex) with sample IDs and sex can be
provided to double check if external and PEDSEX data (often processed at
different centers) match. If a mismatch between PEDSEX and SNPSEX was
detected while SNPSEX == Sex, PEDSEX of these individuals can optionally be
updated (fixMixup=TRUE).
evaluate_check_sex
depicts the X-chromosomal heterozygosity (SNPSEX)
of the samples split by their (PEDSEX).
Usage
evaluate_check_sex(
qcdir,
name,
maleTh = 0.8,
femaleTh = 0.2,
externalSex = NULL,
fixMixup = FALSE,
indir = qcdir,
externalFemale = "F",
externalMale = "M",
externalSexSex = "Sex",
externalSexID = "IID",
verbose = FALSE,
label_fail = TRUE,
highlight_samples = NULL,
highlight_type = c("text", "label", "color", "shape"),
highlight_text_size = 3,
highlight_color = "#c51b8a",
highlight_shape = 17,
highlight_legend = FALSE,
legend_text_size = 5,
legend_title_size = 7,
axis_text_size = 5,
axis_title_size = 7,
title_size = 9,
path2plink = NULL,
keep_individuals = NULL,
remove_individuals = NULL,
exclude_markers = NULL,
extract_markers = NULL,
showPlinkOutput = TRUE,
interactive = FALSE
)
Arguments
qcdir |
[character] /path/to/directory containing name.sexcheck as returned by plink –check-sex. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam and name.sexcheck. |
maleTh |
[double] Threshold of X-chromosomal heterozygosity rate for males. |
femaleTh |
[double] Threshold of X-chromosomal heterozygosity rate for females. |
externalSex |
[data.frame, optional] with sample IDs [externalSexID] and sex [externalSexSex] to double check if external and PEDSEX data (often processed at different centers) match. |
fixMixup |
[logical] Should PEDSEX of individuals with mismatch between PEDSEX and Sex, with Sex==SNPSEX automatically corrected: this will directly change the name.bim/.bed/.fam files! |
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files; only required of fixMixup==TRUE. User needs writing permission to indir. |
externalFemale |
[integer/character] Identifier for 'female' in externalSex. |
externalMale |
[integer/character] Identifier for 'male' in externalSex. |
externalSexSex |
[character] Column identifier for column containing sex information in externalSex. |
externalSexID |
[character] Column identifier for column containing ID information in externalSex. |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
label_fail |
[logical] Set TRUE, to add fail IDs as text labels in scatter plot. |
highlight_samples |
[character vector] Vector of sample IIDs to highlight in the plot (p_sexcheck); all highlight_samples IIDs have to be present in the IIDs of the name.fam file. |
highlight_type |
[character] Type of sample highlight, labeling by IID ("text"/"label") and/or highlighting data points in different "color" and/or "shape". "text" and "label" use ggrepel for minimal overlap of text labels ("text) or label boxes ("label"). Only one of "text" and "label" can be specified. Text/Label size can be specified with highlight_text_size, highlight color with highlight_color, or highlight shape with highlight_shape. |
highlight_text_size |
[integer] Text/Label size for samples specified to be highlighted (highlight_samples) by "text" or "label" (highlight_type). |
highlight_color |
[character] Color for samples specified to be highlighted (highlight_samples) by "color" (highlight_type). |
highlight_shape |
[integer] Shape for samples specified to be highlighted (highlight_samples) by "shape" (highlight_type). Possible shapes and their encoding can be found at: https://ggplot2.tidyverse.org/articles/ggplot2-specs.html#sec:shape-spec |
highlight_legend |
[logical] Should a separate legend for the highlighted samples be provided; only relevant for highlight_type == "color" or highlight_type == "shape". |
legend_text_size |
[integer] Size for legend text. |
legend_title_size |
[integer] Size for legend title. |
axis_text_size |
[integer] Size for axis text. |
axis_title_size |
[integer] Size for axis title. |
title_size |
[integer] Size for plot title. |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_sexcheck) via ggplot2::ggsave(p=p_sexcheck, other_arguments) or pdf(outfile) print(p_sexcheck) dev.off(). |
Details
Both run_check_sex
and evaluate_check_sex
can
simply be invoked by check_sex
.
For details on the output data.frame fail_sex, check the original description on the PLINK output format page: https://www.cog-genomics.org/plink/1.9/formats#sexcheck.
Value
named list with i) fail_sex: dataframe with FID, IID, PEDSEX, SNPSEX and Sex (if externalSex was provided) of individuals failing sex check; ii) mixup: dataframe with FID, IID, PEDSEX, SNPSEX and Sex (if externalSex was provided) of individuals whose PEDSEX != Sex and Sex == SNPSEX; iii) p_sexcheck, a ggplot2-object 'containing' a scatter plot of the X-chromosomal heterozygosity (SNPSEX) of the individuals split by their (PEDSEX), which can be shown by print(p_sexcheck) and iv) plot_data, a data.frame with the data visualised in p_sexcheck (iii).
Examples
qcdir <- system.file("extdata", package="plinkQC")
name <- "data"
path2plink <- '/path/to/plink'
## Not run:
fail_sex <- evaluate_check_sex(qcdir=qcdir, name=name, interactive=FALSE,
verbose=FALSE, path2plink=path2plink)
# highlight samples
highlight_samples <- read.table(system.file("extdata", "keep_individuals",
package="plinkQC"))
fail_sex <- evaluate_check_sex(qcdir=qcdir, name=name, interactive=FALSE,
verbose=FALSE, path2plink=path2plink,
highlight_samples = highlight_samples[,2],
highlight_type = c("label", "color"), highlight_color = "darkgreen")
## End(Not run)
Overview of per sample QC
Description
overviewPerIndividualQC
depicts results of
perIndividualQC
as intersection plots
(via upset
) and returns dataframes indicating
which QC checks individuals failed or passed.
Usage
overviewPerIndividualQC(results_perIndividualQC, interactive = FALSE)
Arguments
results_perIndividualQC |
[list] Output of |
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_overview) via ggplot2::ggsave(p=p_overview, other_arguments) or pdf(outfile) print(p_overview) dev.off(). |
Value
Named [list] with i) nr_fail_samples: total number of samples [integer] failing perIndividualQC, ii) fail_QC containing a [data.frame] with samples that failed QC steps (excluding ancestry) with IID, FID, all QC steps applied by perIndividualQC (max=4), with entries=0 if passing the QC and entries=1 if failing that particular QC and iii) fail_QC_and_ancestry containing a [data.frame] with samples that failed ancestry and QC checks with IID, FID, QC_fail and Ancestry_fail, with entries=0 if passing and entries=1 if failing that check, iii) p_overview, a ggplot2-object 'containing' a sub-paneled plot with the QC-plots.
Examples
indir <- system.file("extdata", package="plinkQC")
qcdir <- tempdir()
name <- "data"
## Not run:
fail_individuals <- perIndividualQC(qcdir=qcdir, indir=indir, name=name,
refSamplesFile=paste(qcdir, "/HapMap_ID2Pop.txt",sep=""),
refColorsFile=paste(qcdir, "/HapMap_PopColors.txt", sep=""),
prefixMergedDataset="data.HapMapIII", interactive=FALSE, verbose=FALSE,
do.run_check_het_and_miss=FALSE, do.run_check_relatedness=FALSE,
do.run_check_sex=FALSE, do.run_check_ancestry=FALSE)
overview <- overviewPerIndividualQC(fail_individuals)
## End(Not run)
Overview of per marker QC
Description
overviewPerMarkerQC
depicts results of perMarkerQC
as
an intersection plot (via upset
) and returns a
dataframe indicating which QC checks were failed or passed.
Usage
overviewPerMarkerQC(results_perMarkerQC, interactive = FALSE)
Arguments
results_perMarkerQC |
[list] Output of |
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_overview) via ggplot2::ggsave(p=p_overview, other_arguments) or pdf(outfile) print(p_overview) dev.off(). |
Value
Named [list] with i) nr_fail_markers: total number of markers [integer] failing perMarkerQC, ii) fail_QC containing a [data.frame] with markers that failed QC steps: marker rsIDs in rows, columns are all QC steps applied by perMarkerQC (max=3), with entries=0 if passing the QC and entries=1 if failing that particular QC.
Examples
indir <- system.file("extdata", package="plinkQC")
qcdir <- tempdir()
name <- "data"
path2plink <- '/path/to/plink'
# the following code is not run on package build, as the path2plink on the
# user system is not known.
# All quality control checks
## Not run:
fail_markers <- perMarkerQC(qcdir=qcdir, indir=indir, name=name,
interactive=FALSE, verbose=TRUE, path2plink=path2plink)
overview <- overviewPerMarkerQC(fail_markers)
## End(Not run)
Quality control for all individuals in plink-dataset
Description
perIndividualQC checks the samples in the plink dataset for their total missingness and heterozygosity rates, the concordance of their assigned sex to their SNP sex, their relatedness to other study individuals and their genetic ancestry.
Usage
perIndividualQC(
indir,
name,
qcdir = indir,
dont.check_sex = FALSE,
do.run_check_sex = TRUE,
do.evaluate_check_sex = TRUE,
maleTh = 0.8,
femaleTh = 0.2,
externalSex = NULL,
externalMale = "M",
externalSexSex = "Sex",
externalSexID = "IID",
externalFemale = "F",
fixMixup = FALSE,
dont.check_het_and_miss = FALSE,
do.run_check_het_and_miss = TRUE,
do.evaluate_check_het_and_miss = TRUE,
imissTh = 0.03,
hetTh = 3,
dont.check_relatedness = FALSE,
do.run_check_relatedness = TRUE,
do.evaluate_check_relatedness = TRUE,
highIBDTh = 0.1875,
mafThRelatedness = 0.1,
filter_high_ldregion = TRUE,
high_ldregion_file = NULL,
genomebuild = "hg19",
dont.check_ancestry = FALSE,
do.run_check_ancestry = TRUE,
do.evaluate_check_ancestry = TRUE,
prefixMergedDataset,
europeanTh = 1.5,
defaultRefSamples = c("HapMap", "1000Genomes"),
refSamples = NULL,
refColors = NULL,
refSamplesFile = NULL,
refColorsFile = NULL,
refSamplesIID = "IID",
refSamplesPop = "Pop",
refColorsColor = "Color",
refColorsPop = "Pop",
studyColor = "#2c7bb6",
label_fail = TRUE,
highlight_samples = NULL,
highlight_type = c("text", "label", "color", "shape"),
highlight_text_size = 3,
highlight_color = "#c51b8a",
highlight_shape = 17,
highlight_legend = FALSE,
interactive = FALSE,
verbose = TRUE,
keep_individuals = NULL,
remove_individuals = NULL,
exclude_markers = NULL,
extract_markers = NULL,
legend_text_size = 5,
legend_title_size = 7,
axis_text_size = 5,
axis_title_size = 7,
subplot_label_size = 9,
title_size = 9,
path2plink = NULL,
showPlinkOutput = TRUE
)
Arguments
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam. |
qcdir |
[character] /path/to/directory where results will be saved.
Per default, qcdir=indir. If do.evaluate_[analysis] is set to TRUE and
do.run_[analysis] is FALSE, |
dont.check_sex |
[logical] If TRUE, no sex check will be conducted; short for do.run_check_sex=FALSE and do.evaluate_check_sex=FALSE. Takes precedence over do.run_check_sex and do.evaluate_check_sex. |
do.run_check_sex |
[logical] If TRUE, run |
do.evaluate_check_sex |
[logical] If TRUE, run
|
maleTh |
[double] Threshold of X-chromosomal heterozygosity rate for males. |
femaleTh |
[double] Threshold of X-chromosomal heterozygosity rate for females. |
externalSex |
[data.frame, optional] Dataframe with sample IDs [externalSexID] and sex [externalSexSex] to double check if external and PEDSEX data (often processed at different centers) match. |
externalMale |
[integer/character] Identifier for 'male' in externalSex. |
externalSexSex |
[character] Column identifier for column containing sex information in externalSex. |
externalSexID |
[character] Column identifier for column containing ID information in externalSex. |
externalFemale |
[integer/character] Identifier for 'female' in externalSex. |
fixMixup |
[logical] Should PEDSEX of individuals with mismatch between PEDSEX and Sex while Sex==SNPSEX automatically corrected: this will directly change the name.bim/.bed/.fam files! |
dont.check_het_and_miss |
[logical] If TRUE, no heterozygosity and missingness check will be conducted; short for do.run_check_heterozygosity=FALSE, do.run_check_missingness=FALSE and do.evaluate_check_het_and_miss=FALSE. Takes precedence over do.run_check_heterozygosity, do.run_check_missingness and do.evaluate_check_het_and_miss. |
do.run_check_het_and_miss |
[logical] If TRUE, run
|
do.evaluate_check_het_and_miss |
[logical] If TRUE, run
|
imissTh |
[double] Threshold for acceptable missing genotype rate in any individual; has to be proportion between (0,1) |
hetTh |
[double] Threshold for acceptable deviation from mean heterozygosity per individual. Expressed as multiples of standard deviation of heterozygosity (het), i.e. individuals outside mean(het) +/- hetTh*sd(het) will be returned as failing heterozygosity check; has to be larger than 0. |
dont.check_relatedness |
[logical] If TRUE, no relatedness check will be conducted; short for do.run_check_relatedness=FALSE and do.evaluate_check_relatedness=FALSE. Takes precedence over do.run_check_relatedness and do.evaluate_check_relatedness. |
do.run_check_relatedness |
[logical] If TRUE, run
|
do.evaluate_check_relatedness |
[logical] If TRUE, run
|
highIBDTh |
[double] Threshold for acceptable proportion of IBD between pair of individuals. |
mafThRelatedness |
[double] Threshold of minor allele frequency filter for selecting variants for IBD estimation. |
filter_high_ldregion |
[logical] Should high LD regions be filtered
before IBD estimation; carried out per default with high LD regions for
hg19 provided as default via |
high_ldregion_file |
[character] Path to file with high LD regions used
for filtering before IBD estimation if |
genomebuild |
[character] Name of the genome build of the PLINK file annotations, ie mappings in the name.bim file. Will be used to remove high-LD regions based on the coordinates of the respective build. Options are hg18, hg19 and hg38. See @details. |
dont.check_ancestry |
[logical] If TRUE, no ancestry check will be conducted; short for do.run_check_ancestry=FALSE and do.evaluate_check_ancestry=FALSE. Takes precedence over do.run_check_ancestry and do.evaluate_check_ancestry. |
do.run_check_ancestry |
[logical] If TRUE, run
|
do.evaluate_check_ancestry |
[logical] If TRUE, run
|
prefixMergedDataset |
[character] Prefix of merged dataset (study and reference samples) used in plink –pca, resulting in prefixMergedDataset.eigenvec. |
europeanTh |
[double] Scaling factor of radius to be drawn around center of European reference samples, with study samples inside this radius considered to be of European descent and samples outside this radius of non-European descent. The radius is computed as the maximum Euclidean distance of European reference samples to the centre of European reference samples. |
defaultRefSamples |
[character] Option to use pre-downloaded individual and population identifiers from either the 1000Genomes or HapMap project. If refSamples and refSamplesFile are not provided, the HapMap identifiers (or 1000Genomes is specified) will be used as default and the function will fail if the reference samples in the prefixMergedDataset do not match these reference samples. If refColors and refColorsFile are not provided, this also sets default colors for the reference populations. |
refSamples |
[data.frame] Dataframe with sample identifiers [refSamplesIID] corresponding to IIDs in prefixMergedDataset.eigenvec and population identifier [refSamplesPop] corresponding to population IDs [refColorsPop] in refColorsfile/refColors. Either refSamples or refSamplesFile have to be specified. |
refColors |
[data.frame, optional] Dataframe with population IDs in column [refColorsPop] and corresponding colour-code for PCA plot in column [refColorsColor]. If not provided and is.null(refColorsFile) default colors are used. |
refSamplesFile |
[character] /path/to/File/with/reference samples. Needs columns with sample identifiers [refSamplesIID] corresponding to IIDs in prefixMergedDataset.eigenvec and population identifier [refSamplesPop] corresponding to population IDs [refColorsPop] in refColorsfile/refColors. |
refColorsFile |
[character, optional] /path/to/File/with/Population/Colors containing population IDs in column [refColorsPop] and corresponding colour-code for PCA plot in column [refColorsColor].If not provided and is.null(refColors) default colors for are used. |
refSamplesIID |
[character] Column name of reference sample IDs in refSamples/refSamplesFile. |
refSamplesPop |
[character] Column name of reference sample population IDs in refSamples/refSamplesFile. |
refColorsColor |
[character] Column name of population colors in refColors/refColorsFile |
refColorsPop |
[character] Column name of reference sample population IDs in refColors/refColorsFile. |
studyColor |
[character] Colour to be used for study population. |
label_fail |
[logical] Set TRUE, to add fail IDs as text labels in scatter plot. |
highlight_samples |
[character vector] Vector of sample IIDs to highlight in the plot (p_sexcheck); all highlight_samples IIDs have to be present in the IIDs of the name.fam file. |
highlight_type |
[character] Type of sample highlight, labeling by IID ("text"/"label") and/or highlighting data points in different "color" and/or "shape". "text" and "label" use ggrepel for minimal overlap of text labels ("text) or label boxes ("label"). Only one of "text" and "label" can be specified. Text/Label size can be specified with highlight_text_size, highlight color with highlight_color, or highlight shape with highlight_shape. |
highlight_text_size |
[integer] Text/Label size for samples specified to be highlighted (highlight_samples) by "text" or "label" (highlight_type). |
highlight_color |
[character] Color for samples specified to be highlighted (highlight_samples) by "color" (highlight_type). |
highlight_shape |
[integer] Shape for samples specified to be highlighted (highlight_samples) by "shape" (highlight_type). Possible shapes and their encoding can be found at: https://ggplot2.tidyverse.org/articles/ggplot2-specs.html#sec:shape-spec |
highlight_legend |
[logical] Should a separate legend for the highlighted samples be provided; only relevant for highlight_type == "color" or highlight_type == "shape". |
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_sampleQC) via ggplot2::ggsave(p=p_sampleQC, other_arguments) or pdf(outfile) print(p_sampleQC) dev.off(). If TRUE, i) depicts the X-chromosomal heterozygosity (SNPSEX) of the samples split by their PEDSEX (if do.evaluate_check_sex is TRUE), ii) creates a scatter plot with samples' missingness rates on x-axis and their heterozygosity rates on the y-axis (if do.evaluate_check_het_and_miss is TRUE), iii) depicts all pair-wise IBD-estimates as histogram (if do.evaluate_check_relatedness is TRUE) and iv) creates a scatter plot of PC1 versus PC2 color-coded for samples of reference populations and study population (if do.check_ancestry is set to TRUE). |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
legend_text_size |
[integer] Size for legend text. |
legend_title_size |
[integer] Size for legend title. |
axis_text_size |
[integer] Size for axis text. |
axis_title_size |
[integer] Size for axis title. |
subplot_label_size |
[integer] Size of the subplot labeling. |
title_size |
[integer] Size for plot title. |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
Details
perIndividualQC wraps around the individual QC functions
check_sex
, check_het_and_miss
,
check_relatedness
and check_ancestry
. For details
on the parameters and outputs, check these function documentations. For
detailed output for fail IIDs (instead of simple IID lists), run each
function individually.
Value
Named [list] with i) fail_list, a named [list] with 1.
sample_missingness containing a [vector] with sample IIDs failing the
missingness threshold imissTh, 2. highIBD containing a [vector] with sample
IIDs failing the relatedness threshold highIBDTh, 3. outlying_heterozygosity
containing a [vector] with sample IIDs failing the heterozygosity threshold
hetTh, 4. mismatched_sex containing a [vector] with the sample IIDs failing
the sexcheck based on SNPSEX and femaleTh/maleTh and 5. ancestry containing
a vector with sample IIDs failing the ancestry check based on europeanTh and
ii) p_sampleQC, a ggplot2-object 'containing' a sub-paneled plot with the
QC-plots of check_sex
,
check_het_and_miss
,
check_relatedness
and check_ancestry
, which can
be shown by print(p_sampleQC).
List entries contain NULL if that specific check was not chosen.
Examples
indir <- system.file("extdata", package="plinkQC")
qcdir <- tempdir()
name <- "data"
# All quality control checks
## Not run:
# whole dataset
fail_individuals <- perIndividualQC(indir=indir, qcdir=qcdir, name=name,
refSamplesFile=paste(qcdir, "/HapMap_ID2Pop.txt",sep=""),
refColorsFile=paste(qcdir, "/HapMap_PopColors.txt", sep=""),
prefixMergedDataset="data.HapMapIII", interactive=FALSE, verbose=FALSE,
do.run_check_het_and_miss=FALSE, do.run_check_relatedness=FALSE,
do.run_check_sex=FALSE, do.run_check_ancestry=FALSE)
# Only check sex and missingness/heterozygosity
fail_sex_het_miss <- perIndividualQC(indir=indir, qcdir=qcdir, name=name,
dont.check_ancestry=TRUE, dont.check_relatedness=TRUE,
interactive=FALSE, verbose=FALSE)
# subset of dataset with sample highlighting
highlight_samples <- read.table(system.file("extdata", "keep_individuals",
package="plinkQC"))
remove_individuals_file <- system.file("extdata", "remove_individuals",
package="plinkQC")
individual_qc <- perIndividualQC(indir=indir, qcdir=qcdir, name=name,
refSamplesFile=paste(qcdir, "/HapMap_ID2Pop.txt",sep=""),
refColorsFile=paste(qcdir, "/HapMap_PopColors.txt", sep=""),
prefixMergedDataset="data.HapMapIII", interactive=FALSE, verbose=FALSE,
do.run_check_ancestry=FALSE, do.evaluate_check_ancestry=TRUE,
path2plink=path2plink,
remove_individuals=remove_individuals_file,
highlight_samples=highlight_samples[,2],
highlight_type = c("text", "color"), highlight_color="goldenrod")
## End(Not run)
Quality control for all markers in plink-dataset
Description
perMarkerQC checks the markers in the plink dataset for their missingness
rates across samples, their deviation from Hardy-Weinberg-Equilibrium (HWE)
and their minor allele frequencies (MAF). Per default, it assumes that IDs of
individuals that have failed perIndividualQC
have been written
to qcdir/name.fail.IDs and removes these individuals when computing
missingness rates, HWE p-values and MAF. If the qcdir/name.fail.IDs file does
not exist, a message is written to stdout but the analyses will continue for
all samples in the name.fam/name.bed/name.bim dataset.
Depicts i) SNP missingness rates (stratified by minor allele
frequency) as histograms, ii) p-values of HWE exact test (stratified by all
and low p-values) as histograms and iii) the minor allele frequency
distribution as a histogram.
Usage
perMarkerQC(
indir,
qcdir = indir,
name,
do.check_snp_missingness = TRUE,
lmissTh = 0.01,
do.check_hwe = TRUE,
hweTh = 1e-05,
do.check_maf = TRUE,
macTh = 20,
mafTh = NULL,
interactive = FALSE,
verbose = TRUE,
keep_individuals = NULL,
remove_individuals = NULL,
exclude_markers = NULL,
extract_markers = NULL,
legend_text_size = 5,
legend_title_size = 7,
axis_text_size = 5,
axis_title_size = 7,
title_size = 9,
subplot_label_size = 9,
path2plink = NULL,
showPlinkOutput = TRUE
)
Arguments
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
qcdir |
[character] /path/to/directory where results will be written to.
If |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam. |
do.check_snp_missingness |
[logical] If TRUE, run
|
lmissTh |
[double] Threshold for acceptable variant missing rate across samples. |
do.check_hwe |
[logical] If TRUE, run |
hweTh |
[double] Significance threshold for deviation from HWE. |
do.check_maf |
[logical] If TRUE, run |
macTh |
[double] Threshold for minor allele cut cut-off, if both mafTh and macTh are specified, macTh is used (macTh = mafTh\*2\*NrSamples). |
mafTh |
[double] Threshold for minor allele frequency cut-off. |
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_marker) via ggplot2::ggsave(p=p_marker, other_arguments) or pdf(outfile) print(p_marker) dev.off(). |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
legend_text_size |
[integer] Size for legend text. |
legend_title_size |
[integer] Size for legend title. |
axis_text_size |
[integer] Size for axis text. |
axis_title_size |
[integer] Size for axis title. |
title_size |
[integer] Size for plot title. |
subplot_label_size |
[integer] Size of the subplot labeling. |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
Details
perMarkerQC wraps around the marker QC functions
check_snp_missingness
, check_hwe
and
check_maf
. For details on the parameters and outputs, check
these function documentations.
Value
Named [list] with i) fail_list, a named [list] with 1.
SNP_missingness, containing SNP IDs [vector] failing the missingness
threshold lmissTh, 2. hwe, containing SNP IDs [vector] failing the HWE exact
test threshold hweTh and 3. maf, containing SNPs Ids [vector] failing the MAF
threshold mafTh/MAC threshold macTh and ii) p_markerQC, a ggplot2-object
'containing' a sub-paneled plot with the QC-plots of
check_snp_missingness
, check_hwe
and
check_maf
, which can be shown by print(p_markerQC).
List entries contain NULL if that specific check was not chosen.
Examples
indir <- system.file("extdata", package="plinkQC")
qcdir <- tempdir()
name <- "data"
path2plink <- '/path/to/plink'
# the following code is not run on package build, as the path2plink on the
# user system is not known.
# All quality control checks
## Not run:
# run on all markers and individuals
fail_markers <- perMarkerQC(indir=indir, qcdir=qcdir, name=name,
interactive=FALSE, verbose=TRUE, path2plink=path2plink)
# run on subset of individuals and markers
keep_individuals_file <- system.file("extdata", "keep_individuals",
package="plinkQC")
extract_markers_file <- system.file("extdata", "extract_markers",
package="plinkQC")
fail_markers <- perMarkerQC(qcdir=qcdir, indir=indir,
name=name, interactive=FALSE, verbose=TRUE, path2plink=path2plink,
keep_individuals=keep_individuals_file, extract_markers=extract_markers_file)
## End(Not run)
plinkQC
Description
Genotype Quality Control with 'PLINK'
Remove related individuals while keeping maximum number of individuals
Description
relatednessFilter
takes a data.frame with pair-wise relatedness
measures of samples and returns pairs of individual IDs that are related as
well as a list of suggested individual IDs to remove.
relatednessFilter
finds pairs of samples whose relatedness estimate
is larger than the specified relatednessTh. Subsequently, for pairs of
individual that do not have additional relatives in the dataset, the
individual with the worse otherCriterionMeasure (if provided) or arbitrarily
individual 1 of that pair is selected and returned as the individual failing
the relatedness check. For more complex family structures, the unrelated
individuals per family are selected (e.g. in a simple case of a
parents-offspring trio, the offspring will be marked as fail, while the
parents will be kept in the analysis). Selection is achieved by constructing
subgraphs of clusters of individuals that are related.
relatednessFilter
then finds the maximum independent set of vertices
in the subgraphs of related individuals. If all individuals are related (i.e.
all maximum independent sets are 0), one individual of that cluster will be
kept and all others listed as failIDs.
Usage
relatednessFilter(
relatedness,
otherCriterion = NULL,
relatednessTh,
otherCriterionTh = NULL,
otherCriterionThDirection = c("gt", "ge", "lt", "le", "eq"),
relatednessIID1 = "IID1",
relatednessIID2 = "IID2",
relatednessFID1 = NULL,
relatednessFID2 = NULL,
relatednessRelatedness = "PI_HAT",
otherCriterionIID = "IID",
otherCriterionMeasure = NULL,
verbose = FALSE
)
Arguments
relatedness |
[data.frame] containing pair-wise relatedness estimates (in column [relatednessRelatedness]) for individual 1 (in column [relatednessIID1] and individual 2 (in column [relatednessIID1]). Columns relatednessIID1, relatednessIID2 and relatednessRelatedness have to present, while additional columns such as family IDs can be present. Default column names correspond to column names in output of plink –genome (https://www.cog-genomics.org/plink/1.9/ibd). All original columns for pair-wise highIBDTh fails will be returned in fail_IBD. |
otherCriterion |
[data.frame] containing a QC measure (in column [otherCriterionMeasure]) per individual (in column [otherCriterionIID]). otherCriterionMeasure and otherCriterionIID have to present, while additional columns such as family IDs can be present. IIDs in relatednessIID1 have to be present in otherCriterionIID. |
relatednessTh |
[double] Threshold for filtering related individuals. Individuals, whose pair-wise relatedness estimates are greater than this threshold are considered related. |
otherCriterionTh |
[double] Threshold for filtering individuals based on otherCriterionMeasure. If related individuals fail this threshold they will automatically be excluded. |
otherCriterionThDirection |
[character] Used to determine the direction for failing the otherCriterionTh. If 'gt', individuals whose otherCriterionMeasure > otherCriterionTh will automatically be excluded. For pairs of individuals that have no other related samples in the cohort: if both otherCriterionMeasure < otherCriterionTh, the individual with the larger otherCriterionMeasure will be excluded. |
relatednessIID1 |
[character] Column name of column containing the IDs of the first individual. |
relatednessIID2 |
[character] Column name of column containing the IDs of the second individual. |
relatednessFID1 |
[character, optional] Column name of column containing the family IDs of the first individual; if only relatednessFID1 but not relatednessFID2 provided, or none provided even though present in relatedness, FIDs will not be returned. |
relatednessFID2 |
[character, optional] Column name of column containing the family IDs of the second individual; if only relatednessFID2 but not relatednessFID1 provided, or none provided even though present in relatedness, FIDs will not be returned. |
relatednessRelatedness |
[character] Column name of column containing the relatedness estimate. |
otherCriterionIID |
[character] Column name of column containing the individual IDs. |
otherCriterionMeasure |
[character] Column name of the column containing the measure of the otherCriterion (for instance SNP missingness rate). |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
Value
named [list] with i) relatednessFails, a [data.frame] containing the data.frame relatedness after filtering for pairs of individuals in relatednessIID1 and relatednessIID2, that fail the relatedness QC; the data.frame is reordered with the fail individuals in column 1 and their related individuals in column 2 and ii) failIDs, a [data.frame] with the [IID]s (and [FID]s if provided) of the individuals that fail the relatednessTh.
Run PLINK principal component analysis
Description
Run plink –pca to calculate the principal components on merged genotypes of the study and reference dataset.
Usage
run_check_ancestry(
indir,
prefixMergedDataset,
qcdir = indir,
verbose = FALSE,
path2plink = NULL,
keep_individuals = NULL,
remove_individuals = NULL,
exclude_markers = NULL,
extract_markers = NULL,
showPlinkOutput = TRUE
)
Arguments
indir |
[character] /path/to/directory containing the basic PLINK data files prefixMergedDataset.bim,prefixMergedDataset.fam and prefixMergedDataset.bed. |
prefixMergedDataset |
[character] Prefix of merged study and reference data files, i.e. prefixMergedDataset.bed, prefixMergedDataset.bim, prefixMergedDataset.fam. |
qcdir |
[character] /path/to/directory to save prefixMergedDataset.eigenvec as returned by plink –pca. User needs writing permission to qcdir. Per default qcdir=indir. |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
Details
Both, run_check_ancestry
and its evaluation by
evaluate_check_ancestry
can simply be invoked by
check_ancestry
.
Examples
indir <- system.file("extdata", package="plinkQC")
qcdir <- tempdir()
prefixMergedDataset <- 'data.HapMapIII'
path2plink <- 'path/to/plink'
# the following code is not run on package build, as the path2plink on the
# user system is not known.
## Not run:
# ancestry check on all individuals in dataset
run <- run_check_ancestry(indir=indir, qcdir=qcdir, prefixMergedDataset,
path2plink=path2plink)
# ancestry check on subset of dataset
remove_individuals_file <- system.file("extdata", "remove_individuals",
package="plinkQC")
run <- run_check_ancestry(indir=indir, qcdir=qcdir, name=name,
remove_individuals=remove_individuals_file, path2plink=path2plink)
## End(Not run)
Run PLINK heterozygosity rate calculation
Description
Run plink –het to calculate heterozygosity rates per individual.
Usage
run_check_heterozygosity(
indir,
name,
qcdir = indir,
verbose = FALSE,
path2plink = NULL,
keep_individuals = NULL,
remove_individuals = NULL,
exclude_markers = NULL,
extract_markers = NULL,
showPlinkOutput = TRUE
)
Arguments
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam. |
qcdir |
[character] /path/to/directory to save name.het as returned by plink –het. User needs writing permission to qcdir. Per default qcdir=indir. |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
Details
All, run_check_heterozygosity
,
run_check_missingness
and their evaluation by
evaluate_check_het_and_miss
can simply be invoked by
check_het_and_miss
.
Examples
indir <- system.file("extdata", package="plinkQC")
name <- 'data'
qcdir <- tempdir()
path2plink <- '/path/to/plink'
# the following code is not run on package build, as the path2plink on the
# user system is not known.
## Not run:
# heterozygosity check on all individuals in dataset
run <- run_check_heterozygosity(indir=indir, qcdir=qcdir, name=name,
path2plink=path2plink)
#' # heterozygosity on subset of dataset
remove_individuals_file <- system.file("extdata", "remove_individuals",
package="plinkQC")
run <- run_check_heterozygosity(indir=indir, qcdir=qcdir, name=name,
remove_individuals=remove_individuals_file,path2plink=path2plink)
## End(Not run)
Run PLINK missingness rate calculation
Description
Run plink –missing to calculate missing genotype rates per individual.
Usage
run_check_missingness(
indir,
name,
qcdir = indir,
verbose = FALSE,
path2plink = NULL,
keep_individuals = NULL,
remove_individuals = NULL,
exclude_markers = NULL,
extract_markers = NULL,
showPlinkOutput = TRUE
)
Arguments
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam. |
qcdir |
[character] /path/to/directory to save name.imiss as returned by plink –missing. User needs writing permission to qcdir. Per default qcdir=indir. |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
Details
All, run_check_heterozygosity
,
run_check_missingness
and their evaluation by
evaluate_check_het_and_miss
can simply be invoked by
check_het_and_miss
.
Examples
indir <- system.file("extdata", package="plinkQC")
name <- 'data'
qcdir <- tempdir()
path2plink <- '/path/to/plink'
# the following code is not run on package build, as the path2plink on the
# user system is not known.
## Not run:
# missingness check on all individuals in dataset
run <- run_check_missingness(indir=indir, qcdir=qcdir, name=name,
path2plink=path2plink)
# missingness on subset of dataset
remove_individuals_file <- system.file("extdata", "remove_individuals",
package="plinkQC")
run <- run_check_missingness(indir=indir, qcdir=qcdir, name=name,
remove_individuals=remove_individuals_file, path2plink=path2plink)
## End(Not run)
Run PLINK IBD estimation
Description
Run LD pruning on dataset with plink –exclude range highldfile –indep-pairwise 50 5 0.2, where highldfile contains regions of high LD as provided by Anderson et (2010) Nature Protocols. Subsequently, plink –genome is run on the LD pruned, maf-filtered data. plink –genome calculates identity by state (IBS) for each pair of individuals based on the average proportion of alleles shared at genotyped SNPs. The degree of recent shared ancestry,i.e. the identity by descent (IBD) can be estimated from the genome-wide IBS. The proportion of IBD between two individuals is returned by –genome as PI_HAT.
Usage
run_check_relatedness(
indir,
name,
qcdir = indir,
highIBDTh = 0.185,
mafThRelatedness = 0.1,
path2plink = NULL,
filter_high_ldregion = TRUE,
high_ldregion_file = NULL,
genomebuild = "hg19",
showPlinkOutput = TRUE,
keep_individuals = NULL,
remove_individuals = NULL,
exclude_markers = NULL,
extract_markers = NULL,
verbose = FALSE
)
Arguments
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam. |
qcdir |
[character] /path/to/directory to save name.genome as returned by plink –genome. User needs writing permission to qcdir. Per default qcdir=indir. |
highIBDTh |
[double] Threshold for acceptable proportion of IBD between pair of individuals; only pairwise relationship estimates larger than this threshold will be recorded. |
mafThRelatedness |
[double] Threshold of minor allele frequency filter for selecting variants for IBD estimation. |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
filter_high_ldregion |
[logical] Should high LD regions be filtered
before IBD estimation; carried out per default with high LD regions for
hg19 provided as default via |
high_ldregion_file |
[character] Path to file with high LD regions used
for filtering before IBD estimation if |
genomebuild |
[character] Name of the genome build of the PLINK file annotations, ie mappings in the name.bim file. Will be used to remove high-LD regions based on the coordinates of the respective build. Options are hg18, hg19 and hg38. See @details. |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
Details
Both run_check_relatedness
and its evaluation via
evaluate_check_relatedness
can simply be invoked by
check_relatedness
.
The IBD estimation is conducted on LD pruned data and in a first step, high LD regions are excluded. The regions were derived from the high-LD-regions file provided by Anderson et (2010) Nature Protocols. These regions are in NCBI36 (hg18) coordinates and were lifted to GRCh37 (hg19) and GRC38 (hg38) coordinates using the liftOver tool available here: https://genome.ucsc.edu/cgi-bin/hgLiftOver. The 'Minimum ratio of bases that must remap' which was set to 0.5 and the 'Allow multiple output regions' box ticked; for all other parameters, the default options were selected. LiftOver files were generated on July 9,2019. The commands for formatting the files are provided in system.file("extdata", 'liftOver.cmd', package="plinkQC").
Examples
indir <- system.file("extdata", package="plinkQC")
name <- 'data'
qcdir <- tempdir()
path2plink <- '/path/to/plink'
# the following code is not run on package build, as the path2plink on the
# user system is not known.
## Not run:
# Relatedness estimation based in all markers in dataset
run <- run_check_relatedness(indir=indir, qcdir=qcdir, name=name,
path2plink=path2plink)
# relatedness estimation on subset of dataset
keep_individuals_file <- system.file("extdata", "keep_individuals",
package="plinkQC")
run <- run_check_relatedness(indir=indir, qcdir=qcdir, name=name,
keep_individuals=keep_individuals_file, path2plink=path2plink)
## End(Not run)
Run PLINK sexcheck
Description
Run plink –sexcheck to calculate the heterozygosity rate across X-chromosomal variants.
Usage
run_check_sex(
indir,
name,
qcdir = indir,
verbose = FALSE,
path2plink = NULL,
keep_individuals = NULL,
remove_individuals = NULL,
exclude_markers = NULL,
extract_markers = NULL,
showPlinkOutput = TRUE
)
Arguments
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam. |
qcdir |
[character] /path/to/directory to save name.sexcheck as returned by plink –check-sex. User needs writing permission to qcdir. Per default qcdir=indir. |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
Details
Both run_check_sex
and its evaluation
evaluate_check_sex
can simply be invoked by
check_sex
.
Examples
indir <- system.file("extdata", package="plinkQC")
name <- 'data'
qcdir <- tempdir()
path2plink <- '/path/to/plink'
# the following code is not run on package build, as the path2plink on the
# user system is not known.
## Not run:
# simple sexcheck on all individuals in dataset
run <- run_check_sex(indir=indir, qcdir=qcdir, name=name)
# sexcheck on subset of dataset
keep_individuals_file <- system.file("extdata", "keep_individuals",
package="plinkQC")
run <- run_check_sex(indir=indir, qcdir=qcdir, name=name,
keep_individuals=keep_individuals_file, path2plink=path2plink)
## End(Not run)
Test lists for different properties of numerics
Description
Test all elements of a list if they are numeric, positive numbers, integers or proportions (range 0-1).
Usage
testNumerics(numbers, positives = NULL, integers = NULL, proportions = NULL)
Arguments
numbers |
[list] whose elements are tested for being numeric. |
positives |
[list] whose elements are tested for being positive numbers. |
integers |
[list] whose elements are tested for being integers. |
proportions |
[list] whose elements are tested for being proportions. between 0 and 1. |