The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
We apply SME to hematology traits of white British individuals from the UK Biobank. After quality control, the remaining data are 349,411 individuals and 543,813 SNPs of common variants. We select the traits mean corpuscular hemoglobin (MCH), mean corpuscular hemoglobin concentration, mean corpuscular volume (MCV), and hematocrit (HCT). As external sparse data source, we leverage DNase I-hypersensitive sites (DHSs) data measured over 12 days of ex-vivo erythroid differentiation (Georgolopoulos et al. 2024). DHS are enriched in transcriptional activity and used to identify regulatory DNA.
The first three traits, MCH, MCHC, and MCV are traits of the red blood cells (RBC). Previous GWAS studies found that genes associated with these traits are implicated in erythroid differentiation. Therefore, we expect that genomic data that indicates regulatory regions gathered during erythropoiesis will be informative for these traits. HCT measures the percentage of red blood cells in the blood. The maturation of erythroid progenitor cells is regulated by an oxygen-sensing mechanism. We hypothesise that HCT, is not informed by functional data of erythropoiesis.
The external data sources used in this study represented genomic
intervals of DHS regions and LD blocks. In the following we mock this
data to illustrate how to create a mask file for sme()
. See
How
To Create a Mask File for more details.
bim_data <- data.frame(
chromosome = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
variant_id = c("rs1", "rs2", "rs3", "rs4", "rs5", "rs6", "rs7", "rs8", "rs9"),
cm_position = c(0, 0, 0, 0, 0, 0, 0, 0, 0),
bp_position = c(10, 20, 30, 40, 50, 60, 70, 80, 90),
allele1 = c("A", "A", "A", "G", "C", "C", "T", "T", "A"),
allele1 = c("G", "G", "G", "A", "T", "T", "A", "A", "G")
)
bim_data$index <- 1:nrow(bim_data)
# DHS intervals
hg19_dhs_regions <- data.frame(
chromosome = c(1, 2, 3),
start = c(5, 45, 85),
stop = c(15, 55, 95)
)
# LD block intervals
hg19_ld_blocks <- data.frame(
chromosome = c(1, 1, 2, 2, 3, 3, 3),
start = c(5, 25, 35, 45, 65, 75, 85),
stop = c(25, 35, 45, 65, 75, 85, 95)
)
We make use of the package GenomicRanges
to efficiently
map the PLINK
data to the intervals of the DHS and LD
data.
# Convert .bim to GRanges object
bim_gr <- GRanges(
seqnames = paste0("chr", bim_data$chromosome),
ranges = IRanges(start = bim_data$bp_position, end = bim_data$bp_position),
variant_id = bim_data$variant_id,
genome = "hg19"
)
# Convert DHS to GRanges object
dhs_gr <- GRanges(
seqnames = paste0("chr", hg19_dhs_regions$chromosome),
ranges = IRanges(start = hg19_dhs_regions$start, end = hg19_dhs_regions$stop),
genome = "hg19"
)
# Find overlaps of BIM variants and DHS intervals
overlaps <- findOverlaps(bim_gr, dhs_gr, maxgap = 0)
# Extract overlapping variants
dhs_data <- bim_data[queryHits(overlaps), ]
dhs_data <- dhs_data[!duplicated(dhs_data$index), ]
Of the 543,813 SNPs in our data, 4,932 are located in the DHS regions. The DHS regions in this data are distributed along the whole genome. To test for marginal epistasis with SME we consider the variants in the DHS regions important.
Next we map the PLINK
data to the LD blocks.
# Convert to GRanges object
ld_gr <- GRanges(
seqnames = paste0("chr", hg19_ld_blocks$chromosome),
ranges = IRanges(start = hg19_ld_blocks$start, end = hg19_ld_blocks$stop),
genome = "hg19"
)
# Find LD block of bim variants
ld_overlaps <- findOverlaps(query = bim_gr, subject = ld_gr)
With these objects, we can create the mask file. With larger data, we
recommend splitting the PLINK
variants that are analyzed
into batches, create one mask file per batch, and submit one job per
batch on a High Peformance Cluster.
output_file <- tempfile()
gxg_group <- "gxg"
ld_group <- "ld"
gxg_variants <- dhs_data$index - 1 # 0-base index for C++
create_hdf5_file(output_file)
for (j in bim_data$index - 1) { # 0-base index for C++
# Write DHS mask
gxg_ds <- sprintf("%s/%d", gxg_group, j)
write_hdf5_dataset(file_name = output_file,
dataset_name = gxg_ds,
gxg_variants)
# Find LD block of focal SNP
focal_gr <- ld_gr[subjectHits(ld_overlaps[j,])]
# Find variants in LD block of focal SNP
focal_ld <- findOverlaps(query = bim_gr, subject = focal_gr)
ld_data <- bim_data[queryHits(focal_ld),]
ld_variants <- ld_data$index - 1 # 0-base index for C++
# Write LD mask
ld_ds <- sprintf("%s/%d", ld_group, j)
write_hdf5_dataset(file_name = output_file,
dataset_name = ld_ds,
ld_variants)
}
dhs_indices <- read_hdf5_dataset(file_name = output_file, dataset_name = gxg_ds)
print(sprintf("DHS indices: %s", paste(dhs_indices, collapse = ", ")))
#> [1] "DHS indices: 0, 4, 8"
With this mask file we can run SME.
sme_result <- sme(
plink_file = "/path/to/plink/data",
pheno_file = "/path/to/pheno/data",
mask_file = "/path/to/mask/file",
gxg_indices = c(1, 2, 3),
chunk_size = 250,
n_randvecs = 10,
n_blocks = 200,
n_threads = 6
)
The genome-wide association test for marginal epistasis in the red blood cell traits MCH and MCV finds genome-wide significant statistical epistasis (\(P < \num{5e-8}\)) on chromosome 6 (Fig. 1). Importantly, most of the SNPs or the genes which they map to have been previously discovered for non-additive gene action related to erythropoiesis and RBC traits.
Figure 1. Manhattan plot of the SME analysis. The dashed blue line is the significance threshold after Bonferroni correction.
The strongest association in the trait MCH, SNP rs4711092 (\(P = \num{1.41e-11}\), PVE 0.7%), maps to the gene secretagogin (). The gene regulates exocytosis by interacting with two soluble NSF adaptor proteins ( and ) and is critical for cell growth in some tissues. A total of five SNPs which SME significantly associates with MCH (strongest association with rs9366624 \(P = \num{1.8e-9}\), PVE 1.1%) are in the gene capping protein regulator and myosin 1 linker 1 (). The gene is known to interact with and regulate caping protein (). plays a role via protein-protein interaction in regulating erythrpoiesis. Specifically, proteins regulate actin dynamics by regulating the activity of . Erythropoiesis leads to modifications in the expression of membrane and cytoskeletal proteins, whose interactions impact cell structure and function. Both genes and have previously been associated with hemoglobin concentration. The strongest association in the trait MCV, SNP rs9276 (\(P = \num{9.09e-10}\), PVE 0.24%) maps to the gene of the major histocompatibility complex. With the SNP rs9366624 (\(P = \num{1.86e-8}\), PVE 0.8%), also the gene is significantly associated with the trait for marginal epistasis. The complete list of significant associations produced by SME is reported in Tab. 1.
Trait | ID | Coordinates | \(P\)-Value | PVE | Gene |
---|---|---|---|---|---|
MCH | rs4711092 | chr6:25684405 | \(\num{1.41e-11}\) | 0.007 (0.0009) | SCGN |
MCH | rs9366624 | chr6:25439492 | \(\num{1.8e-9}\) | 0.011 (0.002) | CARMIL1 |
MCH | rs9461167 | chr6:25418571 | \(\num{2.34e-9}\) | 0.007 (0.001) | CARMIL1 |
MCH | rs9379764 | chr6:25414023 | \(\num{5.53e-9}\) | 0.012 (0.002) | CARMIL1 |
MCH | rs441460 | chr6:25548288 | \(\num{1.2e-8}\) | 0.008 (0.001) | CARMIL1 |
MCH | rs198834 | chr6:26114372 | \(\num{2.77e-8}\) | 0.008 (0.001) | H2BC4 |
MCH | rs13203202 | chr6:25582771 | \(\num{3.17e-8}\) | 0.012 (0.002) | CARMIL1 |
MCV | rs9276 | chr6:33053577 | \(\num{9.09e-10}\) | 0.0024 (0.0004) | HLA-DPB1 |
MCV | rs9366624 | chr6:25439492 | \(\num{1.86e-8}\) | 0.008 (0.001) | CARMIL1 |
Table 1. Significant trait associations for marginal epistasis.
Fitting the linear mixed model of SME also produces narrow-sense heritability estimates equivalent to RHE regression. The heritability estimates of SME for the four traits in this study are similar to heritability estimates found in the literature(Tab. 2).
Trait | \(h^2\) |
---|---|
HCT | 0.28 (0.01) |
MCH | 0.56 (0.03) |
MCHC | 0.14 (0.01) |
MCV | 0.52 (0.03) |
Table 2. Narrow-sense heritability (\(h^2\)) estimates in the SME analysis.
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Sequoia 15.2
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: Europe/Berlin
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] GenomicRanges_1.56.2 GenomeInfoDb_1.40.1 IRanges_2.38.1
#> [4] S4Vectors_0.42.1 BiocGenerics_0.50.0 ggplot2_3.5.1
#> [7] dplyr_1.1.4 smer_0.0.1
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.9 generics_0.1.3 tidyr_1.3.1
#> [4] digest_0.6.37 magrittr_2.0.3 evaluate_1.0.3
#> [7] grid_4.4.2 iterators_1.0.14 CompQuadForm_1.4.3
#> [10] mvtnorm_1.3-3 fastmap_1.2.0 foreach_1.5.2
#> [13] genio_1.1.2 jsonlite_1.8.9 backports_1.5.0
#> [16] httr_1.4.7 purrr_1.0.2 UCSC.utils_1.0.0
#> [19] scales_1.3.0 codetools_0.2-20 jquerylib_0.1.4
#> [22] cli_3.6.3 rlang_1.1.4 XVector_0.44.0
#> [25] munsell_0.5.1 withr_3.0.2 cachem_1.1.0
#> [28] yaml_2.3.10 FMStable_0.1-4 tools_4.4.2
#> [31] parallel_4.4.2 checkmate_2.3.2 colorspace_2.1-1
#> [34] GenomeInfoDbData_1.2.12 vctrs_0.6.5 R6_2.5.1
#> [37] lifecycle_1.0.4 zlibbioc_1.50.0 mvMAPIT_2.0.3
#> [40] pkgconfig_2.0.3 pillar_1.10.1 bslib_0.8.0
#> [43] gtable_0.3.6 glue_1.8.0 Rcpp_1.0.14
#> [46] harmonicmeanp_3.0.1 xfun_0.50 tibble_3.2.1
#> [49] tidyselect_1.2.1 knitr_1.49 farver_2.1.2
#> [52] htmltools_0.5.8.1 rmarkdown_2.29 labeling_0.4.3
#> [55] compiler_4.4.2
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.