The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
The sparse marginal epistasis (SME) test performs a genome-wide search for SNPs involved in genetic interactions while conditioning on information derived from functional genomic data.
By examining one SNP at a time, SME fits a linear mixed model to test for marginal epistasis. It explicitly models the combined additive effects from all SNPs a marginal epistatic effect that represents pairwise interactions involving a test SNP.
The key to the SME formulation is that the interaction between the test SNP and other SNPs may be masked depending on additional information. Masking interactions that do not contribute to trait variance both maximizes the power of the inference as well as realizes the computational efficiency needed to analyze human Biobank scale data.
This pages presents a toy example to show case what insights the implementation of the test provides.
Figure 1. SME performs a genome-wide search for SNPs involved in genetic interactions while conditioning on information derived from functional genomic data. SME has improved power to detect marginal epistasis and runs 10x to 90x faster than state-of-the-art methods.
SME is implemented in R and requires your genetic data to be in PLINK format, which consists of three files:
.bim
: Contains SNP information.bed
: Contains genetic data.fam
: Contains sample informationNote: sme()
cannot handle missing
genotypes. Prior to running sme()
, use PLINK to remove
variants with missing genotypes or impute them.
Additionally, your phenotype (trait) data should be in a separate file following PLINK’s phenotype format.
The sme()
function includes parameters that let you
control memory usage and computational resources. For detailed guidance
on optimizing these settings for your system, please see our tutorial How
To Optimize the Memory Requirements of SME.
When selecting which SNPs to analyze, you must provide their
positions as 1-based indices. These indices correspond
to the row numbers in your .bim
file, where the first SNP
is index 1, the second is index 2, and so on.
For complete details about all function parameters, please refer to
the documentation of the sme()
function.
# File inputs
plink_file <- "path/to/plink/file"
pheno_file <- "path/to/pheno/file"
mask_file <- "path/to/mask/file"
# Parameter inputs
chun_ksize <- 10
n_randvecs <- 10
n_blocks <- 10
n_threads <- 5
# 1-based Indices of SNPs to be analyzed
n_snps <- 100
snp_indices <- 1:n_snps
sme_result <- sme(
plink_file,
pheno_file,
mask_file,
snp_indices,
chunk_size,
n_randvecs,
n_blocks,
n_threads
)
This analysis uses simulated data for demonstration purposes. We simulated synthetic phenotypes from 5000 synthetic genotypes with 6000 SNPs. If you would like to learn how to simulate data, please refer to our tutorial How To Simulate Traits.
When you call the sme()
function, it returns a list
containing multiple elements. The most important one is called
summary
, which contains the main analysis results. These
results are formatted as tidy data, making them compatible with popular
R packages like ggplot2
and dplyr
for further
analysis and visualization.
We use Manhattan plots to visualize genome-wide analyses because they effectively highlight strong associations between genetic variants and traits. In this case, the Manhattan plot specifically shows statistical epistasis (interactions between genes). For reference, we’ve marked the true causal SNPs (Single Nucleotide Polymorphisms) in green on the plot - these are the genetic variants we included in our simulation as having real effects.
SME estimates how much of the total trait variation can be explained by genetic interactions. In this simulation, we set the Phenotypic Variance Explained (PVE) to 5% for each SNP involved in epistatic interactions.
The plot below shows how well our method recovered these effects. It displays two distributions:
The dashed line marks the true 5% PVE level we used in the simulation, allowing you to see how accurately the method estimated the actual effect sizes.
sme_result$summary %>%
ggplot(aes(x = true_gxg_snp, y = pve, fill = true_gxg_snp)) +
geom_boxplot() +
geom_hline(yintercept = 0.05, color = "grey40", linetype = "dashed") +
annotate("text", x = 0.8, y = 0.055,
label = "True per SNP epistatic PVE", color = "black") +
xlab("Epistatic SNP") +
ylab("Phenotypic Variance Explained") +
theme(legend.position = "none")
The SME method uses a linear mixed model to separate different sources of trait variation. One key component it estimates is narrow sense heritability (\(h^2\)), which measures how much trait variation can be explained by additive genetic effects.
The plot below breaks down the estimated sources of trait variation:
In our simulation, we set the true narrow sense heritability to 30%, shown as a dashed line in the plot. This reference line helps you evaluate how accurately SME estimated the genetic components of trait variation.
sme_result$vc_estimate %>%
ggplot(aes(x = component, y = vc_estimate, fill = component)) +
geom_boxplot() +
geom_hline(yintercept = 0.3, color = "grey40", linetype = "dashed") +
annotate("text", x = 0.7, y = 0.33,
label = expression("True " * h^2), color = "black") +
xlab("Component") +
ylab("Variance Component Estimate") +
theme(legend.position = "none")
The estimate of the narrow-sense heritability \(h^2\) is much less variable because it is always informed by the same genetic relatedness matrix. In this small data example it overestimates the heritability but it is unbiased in general.
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Sequoia 15.2
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: Europe/Berlin
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ggplot2_3.5.1 dplyr_1.1.4 smer_0.0.1
#>
#> loaded via a namespace (and not attached):
#> [1] gtable_0.3.6 jsonlite_1.8.9 compiler_4.4.2
#> [4] tidyselect_1.2.1 Rcpp_1.0.14 FMStable_0.1-4
#> [7] parallel_4.4.2 tidyr_1.3.1 jquerylib_0.1.4
#> [10] scales_1.3.0 yaml_2.3.10 fastmap_1.2.0
#> [13] harmonicmeanp_3.0.1 R6_2.5.1 labeling_0.4.3
#> [16] generics_0.1.3 knitr_1.49 genio_1.1.2
#> [19] iterators_1.0.14 backports_1.5.0 checkmate_2.3.2
#> [22] tibble_3.2.1 munsell_0.5.1 bslib_0.8.0
#> [25] pillar_1.10.1 rlang_1.1.4 cachem_1.1.0
#> [28] xfun_0.50 sass_0.4.9 cli_3.6.3
#> [31] withr_3.0.2 magrittr_2.0.3 grid_4.4.2
#> [34] digest_0.6.37 mvMAPIT_2.0.3 foreach_1.5.2
#> [37] mvtnorm_1.3-3 lifecycle_1.0.4 CompQuadForm_1.4.3
#> [40] vctrs_0.6.5 evaluate_1.0.3 glue_1.8.0
#> [43] farver_2.1.2 codetools_0.2-20 colorspace_2.1-1
#> [46] purrr_1.0.2 rmarkdown_2.29 tools_4.4.2
#> [49] pkgconfig_2.0.3 htmltools_0.5.8.1
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.