The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Conditioning Epistasis Search on Open Chromatin

DNAse I hypersensitive sites of erythroid differentiation reveal statistical epistasis in human hematology traits

We apply SME to hematology traits of white British individuals from the UK Biobank. After quality control, the remaining data are 349,411 individuals and 543,813 SNPs of common variants. We select the traits mean corpuscular hemoglobin (MCH), mean corpuscular hemoglobin concentration, mean corpuscular volume (MCV), and hematocrit (HCT). As external sparse data source, we leverage DNase I-hypersensitive sites (DHSs) data measured over 12 days of ex-vivo erythroid differentiation (Georgolopoulos et al. 2024). DHS are enriched in transcriptional activity and used to identify regulatory DNA.

The first three traits, MCH, MCHC, and MCV are traits of the red blood cells (RBC). Previous GWAS studies found that genes associated with these traits are implicated in erythroid differentiation. Therefore, we expect that genomic data that indicates regulatory regions gathered during erythropoiesis will be informative for these traits. HCT measures the percentage of red blood cells in the blood. The maturation of erythroid progenitor cells is regulated by an oxygen-sensing mechanism. We hypothesise that HCT, is not informed by functional data of erythropoiesis.

Mask File Preparation

The external data sources used in this study represented genomic intervals of DHS regions and LD blocks. In the following we mock this data to illustrate how to create a mask file for sme(). See How To Create a Mask File for more details.

bim_data <- data.frame(
  chromosome = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
  variant_id = c("rs1", "rs2", "rs3", "rs4", "rs5", "rs6", "rs7", "rs8", "rs9"),
  cm_position = c(0, 0, 0, 0, 0, 0, 0, 0, 0),
  bp_position = c(10, 20, 30, 40, 50, 60, 70, 80, 90),
  allele1 = c("A", "A", "A", "G", "C", "C", "T", "T", "A"),
  allele1 = c("G", "G", "G", "A", "T", "T", "A", "A", "G")
)
bim_data$index <- 1:nrow(bim_data)

# DHS intervals
hg19_dhs_regions <- data.frame(
  chromosome = c(1, 2, 3),
  start = c(5, 45, 85),
  stop = c(15, 55, 95)
)

# LD block intervals
hg19_ld_blocks <- data.frame(
  chromosome = c(1, 1, 2, 2, 3, 3, 3),
  start = c(5, 25, 35, 45, 65, 75, 85),
  stop = c(25, 35, 45, 65, 75, 85, 95)
)

We make use of the package GenomicRanges to efficiently map the PLINK data to the intervals of the DHS and LD data.

# Convert .bim to GRanges object
bim_gr <- GRanges(
  seqnames = paste0("chr", bim_data$chromosome),
  ranges = IRanges(start = bim_data$bp_position, end = bim_data$bp_position),
  variant_id = bim_data$variant_id,
  genome = "hg19"
)

# Convert DHS to GRanges object
dhs_gr <- GRanges(
  seqnames = paste0("chr", hg19_dhs_regions$chromosome),
  ranges = IRanges(start = hg19_dhs_regions$start, end = hg19_dhs_regions$stop),
  genome = "hg19"
)

# Find overlaps of BIM variants and DHS intervals
overlaps <- findOverlaps(bim_gr, dhs_gr, maxgap = 0)

# Extract overlapping variants
dhs_data <- bim_data[queryHits(overlaps), ]
dhs_data <- dhs_data[!duplicated(dhs_data$index), ]

Of the 543,813 SNPs in our data, 4,932 are located in the DHS regions. The DHS regions in this data are distributed along the whole genome. To test for marginal epistasis with SME we consider the variants in the DHS regions important.

Next we map the PLINK data to the LD blocks.

# Convert to GRanges object
ld_gr <- GRanges(
  seqnames = paste0("chr", hg19_ld_blocks$chromosome),
  ranges = IRanges(start = hg19_ld_blocks$start, end = hg19_ld_blocks$stop),
  genome = "hg19"
)

# Find LD block of bim variants
ld_overlaps <- findOverlaps(query = bim_gr, subject = ld_gr)

With these objects, we can create the mask file. With larger data, we recommend splitting the PLINK variants that are analyzed into batches, create one mask file per batch, and submit one job per batch on a High Peformance Cluster.

output_file <- tempfile()
gxg_group <- "gxg"
ld_group <- "ld"

gxg_variants <- dhs_data$index - 1 # 0-base index for C++

create_hdf5_file(output_file)

for (j in bim_data$index - 1) { # 0-base index for C++
  # Write DHS mask
  gxg_ds <- sprintf("%s/%d", gxg_group, j)
  write_hdf5_dataset(file_name = output_file,
                     dataset_name = gxg_ds,
                     gxg_variants)

  # Find LD block of focal SNP
  focal_gr <- ld_gr[subjectHits(ld_overlaps[j,])]

  # Find variants in LD block of focal SNP
  focal_ld <- findOverlaps(query = bim_gr, subject = focal_gr)
  ld_data <- bim_data[queryHits(focal_ld),]
  ld_variants <- ld_data$index - 1 # 0-base index for C++

  # Write LD mask
  ld_ds <- sprintf("%s/%d", ld_group, j)
  write_hdf5_dataset(file_name = output_file,
                     dataset_name = ld_ds,
                     ld_variants)
}

dhs_indices <- read_hdf5_dataset(file_name = output_file, dataset_name = gxg_ds)
print(sprintf("DHS indices: %s", paste(dhs_indices, collapse = ", ")))
#> [1] "DHS indices: 0, 4, 8"

With this mask file we can run SME.

sme_result <- sme(
  plink_file = "/path/to/plink/data",
  pheno_file = "/path/to/pheno/data",
  mask_file = "/path/to/mask/file",
  gxg_indices = c(1, 2, 3),
  chunk_size = 250,
  n_randvecs = 10,
  n_blocks = 200,
  n_threads = 6
)

The genome-wide association test for marginal epistasis in the red blood cell traits MCH and MCV finds genome-wide significant statistical epistasis (\(P < \num{5e-8}\)) on chromosome 6 (Fig. 1). Importantly, most of the SNPs or the genes which they map to have been previously discovered for non-additive gene action related to erythropoiesis and RBC traits.

Figure 1. Manhattan plot of the SME analysis. The dashed blue line is the significance threshold after Bonferroni correction.

The strongest association in the trait MCH, SNP rs4711092 (\(P = \num{1.41e-11}\), PVE 0.7%), maps to the gene secretagogin (). The gene regulates exocytosis by interacting with two soluble NSF adaptor proteins ( and ) and is critical for cell growth in some tissues. A total of five SNPs which SME significantly associates with MCH (strongest association with rs9366624 \(P = \num{1.8e-9}\), PVE 1.1%) are in the gene capping protein regulator and myosin 1 linker 1 (). The gene is known to interact with and regulate caping protein (). plays a role via protein-protein interaction in regulating erythrpoiesis. Specifically, proteins regulate actin dynamics by regulating the activity of . Erythropoiesis leads to modifications in the expression of membrane and cytoskeletal proteins, whose interactions impact cell structure and function. Both genes and have previously been associated with hemoglobin concentration. The strongest association in the trait MCV, SNP rs9276 (\(P = \num{9.09e-10}\), PVE 0.24%) maps to the gene of the major histocompatibility complex. With the SNP rs9366624 (\(P = \num{1.86e-8}\), PVE 0.8%), also the gene is significantly associated with the trait for marginal epistasis. The complete list of significant associations produced by SME is reported in Tab. 1.

Trait	ID	Coordinates	\(P\)-Value	PVE	Gene
MCH	rs4711092	chr6:25684405	\(\num{1.41e-11}\)	0.007 (0.0009)	SCGN
MCH	rs9366624	chr6:25439492	\(\num{1.8e-9}\)	0.011 (0.002)	CARMIL1
MCH	rs9461167	chr6:25418571	\(\num{2.34e-9}\)	0.007 (0.001)	CARMIL1
MCH	rs9379764	chr6:25414023	\(\num{5.53e-9}\)	0.012 (0.002)	CARMIL1
MCH	rs441460	chr6:25548288	\(\num{1.2e-8}\)	0.008 (0.001)	CARMIL1
MCH	rs198834	chr6:26114372	\(\num{2.77e-8}\)	0.008 (0.001)	H2BC4
MCH	rs13203202	chr6:25582771	\(\num{3.17e-8}\)	0.012 (0.002)	CARMIL1
MCV	rs9276	chr6:33053577	\(\num{9.09e-10}\)	0.0024 (0.0004)	HLA-DPB1
MCV	rs9366624	chr6:25439492	\(\num{1.86e-8}\)	0.008 (0.001)	CARMIL1

Table 1. Significant trait associations for marginal epistasis.

Fitting the linear mixed model of SME also produces narrow-sense heritability estimates equivalent to RHE regression. The heritability estimates of SME for the four traits in this study are similar to heritability estimates found in the literature(Tab. 2).

Trait	\(h^2\)
HCT	0.28 (0.01)
MCH	0.56 (0.03)
MCHC	0.14 (0.01)
MCV	0.52 (0.03)

Table 2. Narrow-sense heritability (\(h^2\)) estimates in the SME analysis.

References

Stamp J, Smith Pattillo S, Weinreich D, Crawford L (2025). Sparse modeling of interactions enables fast detection of genome-wide epistasis in biobank-scale studies. biorxiv, https://doi.org/10.1101/2025.01.11.632557
Georgolopoulos, G. GEO Data Set: Discrete regulatory modules instruct hematopoietic lineage commitment and differentiation https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE182816 (2024).

SessionInfo

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Sequoia 15.2
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: Europe/Berlin
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#> [1] GenomicRanges_1.56.2 GenomeInfoDb_1.40.1  IRanges_2.38.1      
#> [4] S4Vectors_0.42.1     BiocGenerics_0.50.0  ggplot2_3.5.1       
#> [7] dplyr_1.1.4          smer_0.0.1          
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.9              generics_0.1.3          tidyr_1.3.1            
#>  [4] digest_0.6.37           magrittr_2.0.3          evaluate_1.0.3         
#>  [7] grid_4.4.2              iterators_1.0.14        CompQuadForm_1.4.3     
#> [10] mvtnorm_1.3-3           fastmap_1.2.0           foreach_1.5.2          
#> [13] genio_1.1.2             jsonlite_1.8.9          backports_1.5.0        
#> [16] httr_1.4.7              purrr_1.0.2             UCSC.utils_1.0.0       
#> [19] scales_1.3.0            codetools_0.2-20        jquerylib_0.1.4        
#> [22] cli_3.6.3               rlang_1.1.4             XVector_0.44.0         
#> [25] munsell_0.5.1           withr_3.0.2             cachem_1.1.0           
#> [28] yaml_2.3.10             FMStable_0.1-4          tools_4.4.2            
#> [31] parallel_4.4.2          checkmate_2.3.2         colorspace_2.1-1       
#> [34] GenomeInfoDbData_1.2.12 vctrs_0.6.5             R6_2.5.1               
#> [37] lifecycle_1.0.4         zlibbioc_1.50.0         mvMAPIT_2.0.3          
#> [40] pkgconfig_2.0.3         pillar_1.10.1           bslib_0.8.0            
#> [43] gtable_0.3.6            glue_1.8.0              Rcpp_1.0.14            
#> [46] harmonicmeanp_3.0.1     xfun_0.50               tibble_3.2.1           
#> [49] tidyselect_1.2.1        knitr_1.49              farver_2.1.2           
#> [52] htmltools_0.5.8.1       rmarkdown_2.29          labeling_0.4.3         
#> [55] compiler_4.4.2

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.