Loading Data
The current version of cape recognizes data in multiple formats:
R/qtl, R/qtl2, and PLINK.
Data for the demonstrations below can be found in the demo folder
CAPE github site: https://github.com/TheJacksonLaboratory/cape
Use the green “Code” button in the upper right of the page to
download the repository and put it where you keep code or projects or
other things of this nature. The top level of the folder has a .here
file in it from the here package. This
will anchor any R session opened in this folder to the top level and
path names can be constructed using just folder names. For example, to
create a path that points to the parameter file in the demo_PLINK
folder, use the following command:
param_file_path <- here("demo", "demo_PLINK", "0_plink.parameters_0.yml")
Open R and set the working directory to cape_master after downloading
the code linked above. This will allow the relative paths constructed
with here()
in the demo code to find the data and parameter
files it needs.
Also make sure the package here is installed and loaded.
install.packages("here")
library("here")
Loading data formatted for PLINK
Completely new to this version of cape is the ability to load files
formatted for PLINK (Purcell et al.
(2007)). This enables broader use of cape by researchers who are
interested in studying genetic interactions in human cohorts.
We have provided example data with the cape demo_plink. Please see
the PLINK
website for more information about PLINK and formatting data for
PLINK.
For cape, we require a ped file, a map file, and a phenotype file.
PLINK standard format includes only one trait, and cape requires at
least two traits. We therefore need the phenotype file in the pheno.txt
format as described here.
To generate the data for the demo included in cape, we downloaded
example PLINK hapmap
data.
To reduce the size of the data set for demonstration purposes, we
sampled 1000 the SNPs in the data set using PLINK:
./plink –file hapmap1 –make-bed –thin-count 1000 –recode –out
test
We generated two random, normally distributed phenotypes for the
phenotype file. These are not particularly interesting, but they do
demonstrate what the files should look like.
The following code shows how to read in PLINK data, use the function
plink2cape
to read in the map, ped, and phenotype
files.
This code is also included in demo_PLINK.R, which is included with
cape.
data_path <- here("demo", "demo_PLINK", "data")
ped <- file.path(data_path, "test.ped")
map <- file.path(data_path, "test.map")
pheno <- file.path(data_path, "test.pheno")
out <- file.path(data_path, "test.csv")
param_file <- here("demo", "demo_PLINK", "0_plink.parameters_0.yml")
cross_obj <- plink2cape(ped, map, pheno, out = "out.csv")
data_obj <- cross_obj$data_obj
geno_obj <- cross_obj$geno_obj$geno
Getting started with an example
In this vignette, we will reanalyze a data set described in Reifsnyder (2000). This experiment was performed
to identify quantitative trait loci (QTL) for obesity and other risk
factors of type II diabetes in a reciprocal back-cross of non-obese
non-diabetic NON/Lt mice and the diabetes-prone, New Zealand obese
(NZO/HILt) mice. The study found multiple main-effect QTL influencing
phenotypes associated with diabetes and obesity as well as multiple
epistatic interactions. In addition, maternal environment (i.e. whether
the mother was obese) was found to interact with several markers and
epistatic pairs to influence the risk of obesity and diabetes of the
offspring. The complex nature of diabetes and obesity, along with their
complex and polygenic inheritance patterns, make this data set ideal for
an analysis of epistasis and pleiotropy.
Included in this dataset are 204 male mice genotyped at 85 MIT
markers across the genome. The phenotypes included are the body weight
(g), insulin levels (ng/mL), and the log of plasma glucose levels
(mg/dL), all measured at age 24 weeks. In addition, there is a variable
called ``pgm’’ indicating whether the mother of each mouse was normal
weight (0) or obese (1).
These data can be accessed through the demo_qtl demonstration code,
which was loaded above.
Data visualization
Before proceeding with an analysis we recommend exploring the data
visually. cape provides a few basic functions for looking at
distributions of traits and the correlations between traits.
The figure below shows distributions of three traits: body weight at
24 weeks (BW_24), serum insulin levels (INS_24), and the log of serum
glucose levels (log_GLU_24). We have selected these traits out of all
traits included in the data set by using the pheno_which
argument. Leaving this argument as NULL includes all traits in the
plot.
hist_pheno(obesity_cross, pheno_which = c("BW_24", "INS_24", "log_GLU_24"))
Body weight looks relatively normally distributed, but glucose and
insulin have obviously non-normal distributions. We can see this in a
different way by looking at the Qnorm plots for each trait using
`qnorm_pheno.’
qnorm_pheno(obesity_cross, pheno_which = c("BW_24", "INS_24", "log_GLU_24"))
In general we recommend mean centering and normalizing all traits
before proceeding with the analysis. Trait normalization can be achieved
through log transformation, quantile normalization, or another method
before the analysis. The function norm_pheno
uses quantile
normalization to fit the phenotypes to a normal distribution. Briefly,
this process ranks the trait values and divides by n-1. It then returns
the quantiles of the normal distribution using qnorm.
Mean
centering subtracts the mean value from each trait, and standardizing
divides by the standard deviation.
obesity_cross <- norm_pheno(obesity_cross, mean_center = TRUE)
Now when we plot the distributions compared to a theoretical normal
distribution, we see that our traits are much closer to normally
distributed. Insulin still has a ceiling effect, which cannot be removed
by normalization because rank cannot be determined among equal
values.
qnorm_pheno(obesity_cross, pheno_which = c("BW_24", "INS_24", "log_GLU_24"))
A Note on Trait Selection
Cape relies on the selection of two or more traits that have common
genetic factors but are not identical across all individuals. One
indirect way at getting at this is through trait correlation. Traits
that are modestly correlated may have both common and divergent
contributing genetic factors. This is in contrast to traits that are
perfectly correlated and likely have only common genetic influences, and
to traits that are completely uncorrelated and likely have divergent
genetic influences.
We can look at the correlation of the traits in our data set using
plot_pheno_cor
The figure below shows the correlations
between all pairs of traits. Here, we have also colored the points by
the covariate “pgm” to look for some initial trends. You can see that
mice with obese mothers tended to have higher body weight, insulin
levels, and glucose levels.
The lower triangle of panels in the figure below shows the point
clouds of each pair of traits plotted against each other. The diagonal
shows the distribution of each individual trait, and the upper triangle
gives numeric information about pairwise correlations. If
color_by
is not specified, only the overall pearson R
values are shown for each trait pair. If color_by
is
specified, we show the overall correlation as well as correlations for
each group in color_by.
plot_pheno_cor(obesity_cross, pheno_which = c("BW_24", "INS_24", "log_GLU_24"),
color_by = "pgm", group_labels = c("Non-obese", "Obese"))
The traits in this data set are all modestly correlated, which is
ideal for cape. In addition, we have selected traits from a range of
physiological levels. Body weight is a high-level physiological trait,
whereas glucose and insulin levels are endophenotypes measured at the
molecular level.
Cape measures interactions between pairs of genes across all traits
with the assumption that different genetic interactions identified for a
single gene pair in the context of different phenotypes represent
multiple manifestations of a single underlying gene network. By
measuring the interactions between genetic variants in different
contexts we can gain a clearer picture of the network underlying
statistical epistasis (Carter et al.
2012).
Specifying cape parameters
Before starting a cape run, we need to specify all the parameters for
the run in a parameter file. We use a .yml file for these parameters. A
.yml file holds information easily readable to both humans and
computers. We have multiple examples of cape parameter files in the demo
folder associated with the cape package. It is probably easiest to start
with one of these and modify it to fit your data. The following sections
describe each cape parameter in the yml file.
General parameters
The section of general parameters is where we specify which traits we
will use, whether we want cape to normalize the traits for us, and
whether we want cape to save the results. This is also the section in
which we tell cape whether we want to use a kinship correction or not.
kinship corrections are discussed further below.
If a parameter has multiple entries, for example we always want to
test multiple traits, each entry gets its own line starting with a
dash.
traits: This is where the traits to be scanned are
entered. Each trait is entered on its own line preceded by a dash.
covariates: Any covariates are entered here. These
are originally part of the trait matrix and are designated by cape as
covariates. Each covariate specified is included as an additive
covariate in each model, and also tested as an interactive covariate.
Here we specify “pgm” as our covariate.
scan_what: This parameter refers to whether we will
scan individual traits or composite traits called eigentraits.
Eigentraits are calculated by factoring the trait matrix by singular
value decomposition (SVD):
\[Y = U \cdot V \cdot W^{T}\]
Where \(Y\) is a matrix containing
one column for each mean-centered, normalized phenotype and one row for
each individual. If \(Y\) contains more
individuals than phenotypes, the \(U\)
matrix has the same dimensions as Y with each column containing one
eigentrait. \(V\) contains the singular
values, and \(W^{T}\) contains the
right singular vectors in rows. See Carter et al.
(2012) for more details.
The SVD de-correlates the traits concentrating phenotypic features
into individual eigentraits. One benefit of this process is that
variants that are weakly correlated to several traits due to common
underlying processes may be strongly correlated to one of the
eigentraits. This eigentrait captures the information of the underlying
process, making strong main effects distributed between traits easier to
detect and identify as potential interaction loci and/or covariates.
To specify using individual traits, set scan_what
to
raw_traits. To specify using eigentraits, set scan_what
to
eigentraits.
traits scaled: This is a logical value (true/false)
indicating whether cape should mean center and standardize the
traits.
traits normalized: This is a logical value
indicating whether cape should normalize the traits.
eig_which: After performing the SVD to create
orthogonal eigentraits, you may wish to analyze only a subset of them in
cape. For example, if the first two eigentraits capture more than 90% of
the trait variance, you may wish to discard the last eigentrait. This
results in a loss of some information, but may increase power to detect
important trait-related signals.
To specify which eigentraits to use, enter each on its own line. This
is a bit more tedious than simply entering a number of eigentraits, but
offers a little more flexibility in case you would like to analyze
eigentraits that don’t necessarily start at eigentrait 1.
pval_correction: This is where we specify the
correction for multiple testing. It can be any of the options in
p.adjust:
“holm”, “hochberg”, “hommel”, “bonferroni”, “BH”,
“BY”, “fdr”, or “none.”
use_kinship: A logical value indicating whether to
use a kinship correction.
pop: This parameter is only necessary to set if
use_kinship is set to true. If use_kinship is true, pop can be one of
“2PP” for a two-parent cross, “MPP” for a multi-parent cross, or “RIL”
for a recombinant inbred line. This parameter is passed to
kinship
to inform the calculation of the kinship
matrix.
save_results: A logical value indicating whether
results should be written to files.
use_saved_results A logical value indicating whether
cape should use data already written to file. This can be helpful if a
job aborts partway through. You can use everything that has been
calculated already and save time on the next run. If use_saved_results
is true, only results that don’t already exist will be calculated. This
can cause problems if there has been a parameter change between two
runs, and there is a mismatch between previously calculated results and
new results. We recommend setting this to false for most cases.
transform_to_phenospace A logical value indicating
that is required scanning eigentraits. If true, all effects are rotate
back to trait space after calculation, such that the final networks show
marker influences on traits. If false, all effects will be kept in terms
of eigentraits, and all final tables and plots will show marker effects
on eigentraits.
Single scan parameters
Cape performs marker regression on individual markers before running
the pairwise tests. If there are more markers than can be tested
reasonably in the pairwise scan, cape uses the results from the
single-locus scan to select markers for the pairwise scan.
ref_allele: The reference allele. In two-parent
populations, this will usually be allele A (major allele). In
multi-parent populations, this may be a different allele. For example,
the C57B6/J mouse is often used as a reference strain in mouse studies.
It is also one of the founders of the Diversity Outbred (DO) and
Collaborative Cross (CC) mice (Chesler et al.
2008), and serves as a reasonable reference. To specify the B6
mouse as the reference, use its letter code “B” as the reference allele.
All alleles in cape are stored as individual letters, A through the
number of alleles that are present. For example, there are eight founder
alleles in the DO/CC mice that are represented by the letters A through
H.
singlescan_perm: This parameter specifies the number
of permutations done in the single-locus scan. It is perfectly okay to
set this to 0. The permutations are not used for anything in cape. They
simply allow you to make a first-pass glance at identifying markers with
main effects in your data. However, if you are interested in doing the
main effect scan properly, we recommend using R/qtl2 (Karl W. Broman et al. 2019).
alpha: If “singlescan_perm” is a number greater than
0, you can also specify a significance threshold alpha. If alpha is
specified, cape will calculate an effect size threshold for each value
of alpha. These can be plotted using plot_singlescan
.
Marker Selection Parameters
As mentioned above, the single-locus scan is primarily used to select
markers for the pairwise scan. We usually do this by selecting the
markers with the largest main effects. Alternatively, you can also
specify markers for the pairwise scan using a text file.
marker_selection_method: This parameter should be
either “top_effects”, or “from_list.” If “top_effects,” cape will select
markers for the pair scan based on their main effects across all traits
in the single-locus scan. In this case, a few additional parameters need
to be specified:
peak_density: To select markers for the pairscan,
cape first identifies peaks in effect size across all traits, and
samples markers within the peaks for further testing. “peak_density”
specifies the fraction of markers that will be sampled uniformly from
under a large peak. For example, if “peak_density” is set to 0.5, 50% of
markers under a peak will be selected for downstream testing. For
populations with high linkage disequilibrium (LD), you might want to set
this lower to get a better sampling of peaks from across the genome with
less redundancy. For a population with low LD, you may want to set this
parameter higher to get a better sampling of markers with large main
effects.
num_alleles_in_pairscan: This parameter sets how
many markers will be tested in the pairwise scan. Cape is relatively
computationally intensive, and we cannot do exhaustive pairwise testing
in densely genotyped populations. We suggest limiting this number to
around 1500 markers. These can be tested in roughly 24 hours using 2 or
3 traits/eigentraits and 1.5 million permuted tests. Increasing the
number of traits/eigentraits analyzed substantially increases the time
of a run, and it might be necessary to lower the number of markers
tested further if there are many traits/eigentraits in the analysis.
tolerance: In selecting markers by their main effect
size, cape starts at a high effect size, and gradually lowers it until
it has collected the desired number of markers. The “tolerance” gives
cape some wiggle room in how many markers are selected. If
“num_alleles_in_pairscan” is 100 and the “tolerance” is 5, cape will
allow any number of markers between 95 and 105.
snp_file: If “marker_selection_method” is
“from_list” a file name needs to be specified. The file must be in the
results directory. It is a tab-delimited text file with one column. The
column should contain the names of the markers to be tested in the
pairscan. This can get a little tricky for multi-parent crosses. Because
cape tests individual alleles and not markers in the pair scan, the
markers must also have the allele name appended with an underscore
(“_“). This is relatively simple for a two-parent cross: If the
reference allele is A, all markers will just have a”_B” appended to
their name. If, however, this is a multi-parent population, specifying
the correct allele is important. Any of the alleles other than the
reference can be specified for each marker. The underscore introduces
another little complication, which is that no underscores are allowed in
marker names except to separate the marker name from the allele. If
there are underscores in marker names, cape deletes these. This means
that any file specifying marker names for the pairscan should also
remove all underscores that are not separating the marker name from the
allele. This is a bit cumbersome, but can be completely avoided by
setting “marker_selection_method” to “top_effects.”
Pairscan Parameters
The pairwise scan has a few additional parameters to set. Because
testing genetic markers in LD can lead to false positives, we refrain
from testing marker pairs that are highly correlated. This (Pearson)
correlation is set by the user as “max_pair_cor”. We generally use a
value of 0.5 in our own work, but this can be raised or lowered
depending on your own assessment of the influence of LD in your
population.
max_pair_cor: A numeric value between 0 and 1
indicating the maximum Pearson correlation that two markers are allowed
in pairwise testing. If the correlation between a pair of markers
exceeds this threshold, the pair is not tested. If this value is set to
NULL, “min_per_genotype” must have a numeric value.
min_per_geno: This is an alternative way to limit
the effect of LD on pairwise testing. This number sets the minimum
number of individuals allowable per pairwise genotype combination. If
for a given marker pair, one of the genotype combinations is
underrepresented, the marker pair is not tested. Setting this parameter
is not recommended if you have continuous genotype probabilities, as the
number of genotype combinations will be too high. If this value is NULL,
max_pair_cor must have a numeric value.
pairscan_null_size: This parameter can be a little
confusing, as it is different from the number of permutations to run in
the the pairwise scan. Carter et al.
(2012) showed that multiple tests from a single permutation can
be combined to create the null distribution for cape statistics. This
saves enormously on time. So instead of specifying the number of
permutations, we specify the total size of the null distribution. If you
are testing 100 markers, you will test at most 100 choose 2, or 4950
marker pairs. Choose a “pairscan_null_size” that is at least as big as
the number of marker pairs you are testing. For smaller numbers of
markers, you may want to choose a null distribution size that is many
times larger than the number of pairs you are testing.
Running cape
Unlike previous versions of cape, which required the user to perform
many individual steps, this version of cape can be run with a single
command run_cape
. The line below runs the complete cape
pipeline for the demo NON/NZO mouse backcross. If you run this line
interactively, set verbose to TRUE to see the progress printed to your
screen. When verbose is set to FALSE, important messages are still
printed, but printing of progress is suppressed.
Note on results: To have the results files saved so that you
can visualize them without further commands, open up the parameter file
0_NON_NZO.parameters_0.yml, and set save_results
to
true
. This will tell cape to save results files to the
results directory. Otherwise, only the final_cross
object
is returned, and plots must be generated using further commands.
To use the saved results, set use_saved_results
to
true
. This will enable you to read in and use previously
saved results in case the cape run halts before finishing.
param_file <- here("demo", "demo_qtl", "0_NON_NZO.parameters_0.yml")
results_path <- here("demo", "demo_qtl")
final_cross <- run_cape(obesity_cross, obesity_geno, results_file = "NON_NZO",
p_or_q = 0.05, verbose = FALSE, param_file = param_file, results_path = results_path)
## Removing 18 individuals with missing phenotypes.
The benefit to this new function is obvious: cape is now much easier
to run. However, it comes with a sizeable reduction in flexibility. For
most users the benefit of the increased ease of use will far outweigh
the decrease in flexibility.
If you do need of more flexibility, all functions in the function
run_cape
are exported, so the entire analysis can be run
step-by-step as it was in earlier versions. All plotting functions are
also exported to allow greater flexibility in plotting. Although
run_cape
does some plotting automatically, as long as
save_results
is set to true in the parameter file, you may
wish to re-run some plotting functions with your own settings.
However you choose to run cape, the cape team is very happy to answer
any questions and to help with running or interpreting any cape
analysis.
Eigentraits
As described above, the first step performed by cape is typically to
decompose the trait matrix into eigentraits. This is done if “scan_what”
is set to “eigentraits.”
This step uses singular value decomposition (SVD) to decompose the
trait matrix into orthogonal eigentraits. Because we use modestly
correlated traits in cape by design, the eigentrait decomposition may
help concentrate correlated signals, that are otherwise distributed
across traits, into individual eigentraits. This potentially increases
power to detect variants associated with the common components of groups
of traits.
Cape uses the function plot_svd
to plot the results of
the SVD. The plot for the NON/NZO data set are shown below.
In the example illustrated here, the first eigentrait captures 75% of
the total trait variance. This eigentrait describes the processes by
which body weight, glucose levels, and insulin levels all vary together.
The correlations between obesity and risk factors for obesity, such as
elevated insulin and fasting glucose levels are well known (Permutt, Wasson, and Cox 2005; Das and Elbein 2006;
Haffner 2003). The second eigentrait captures 14% of the variance
in the phenotypes. It captures the processes through which glucose and
body weight vary in opposite directions. This eigentrait may be
important in distinguishing the genetic discordance between obesity and
diabetes. While obesity is a strong risk factor for diabetes, not all
those who are obese have diabetes, and not all those with diabetes are
obese (Permutt, Wasson, and Cox 2005; Burcelin et
al. 2002).
The third eigentrait is less interpretable biologically, as it
describes the divergence of blood glucose and insulin levels. It may
represent a genetic link between glucose and body weight that is
non-insulin dependent. Because we are primarily interested in the
connection between diabetes and insulin, we used only the first two
eigentraits for the analysis. In many cases in which more than two
phenotypes are being analyzed, the first two or three eigentraits will
capture the majority of the variance in the data and capture obvious
features. Other eigentraits may capture noise or systematic bias in the
data. Often the amount of total variance captured by such eigentraits is
small, and they can be removed from the analysis.
Ultimately, there is no universal recipe for selecting which
eigentraits should be included in the analysis, and the decision will be
based on how the eigentraits contribute to the original phenotypes and
how much variance in the data they capture.
Single-locus scan
After computing eigentraits, we go on to perform marker regression on
all individual markers:
\[U_{i}^{j} = \beta_{0}^{j} +
x_{i}\beta^{j} + \epsilon_{i}^{j}\]
The index \(i\) runs from 1 to the
number of individuals, and \(j\) runs
from 1 to the number of eigentraits or traits. \(x_{i}\) is the probability of the presence
of the reference allele for individual \(i\) at locus \(j\). The primary purpose of this scan is is
to select markers for the pairwise scan if there are too many to test
exhaustively.
Although run_cape
creates files with the singlescan
results automatically, it is also sometimes desireable to re-plot these
results with different parameters. We show how to do that below by
reading in the singlescan results file and using
plot_singlescan
with parameters different from those in
run_cape.
singlescan_obj <- readRDS(here("demo", "demo_qtl", "results", "NON_NZO_singlescan.RDS"))
plot_singlescan(final_cross, singlescan_obj, line_type = "h", lwd = 2,
covar_label_size = 1)
Pairwise Testing
The purpose of the pairwise scan is to find interactions, or
epistasis, between variants. The epistatic models are then combined
across traits or eigentraits to infer a parsimonious model that takes
data from all traits or eigentraits into account.
To find epistatic interactions we test the following regression model
for each variant 1 and 2:
\[
U_{i}^{j} = \beta_{0}^{j} +
\underbrace{\sum_{c=1}^{2}x_{c,i}\beta_{c}^{j}}_{\mathrm{covariates}} +
\underbrace{x_{1,i}\beta_{1}^{j} +
x_{2,i}\beta_{2}^{j}}_{\mathrm{main\;effects}} +
\underbrace{x_{1,i}x_{2,i}
\beta_{12}^{j}}_{\mathrm{interaction}} + \epsilon_{i}^{j}
\]
The terms in this equation are the same as those in the equation for
the single-marker scan except for the addition of the term for the
interaction between the two variants being tested.
kinship Correction
To reduce the incidence of false positives arising from genetic
relatedness, cape can implement a kinship correction (Kang et al. 2008).
The relatedness matrix (\(K\)) is
calculated using the markers on the remaining chromosomes as
follows:
\[ K = \frac{G \times
G^T}{n},\]
where \(G\) and \(n\) is the number of markers in \(G\). This matrix is then used to correct
the genotype and phenotype matrices for kinship using hierarchical
linear models (Kang et al. 2008; Gelman and Hill
2006).
Currently cape only implements an overall kinship correction.
Anecdotally, we have tested leave-two-chromosomes-out methods as an
extension of the leave-one-chromosome-out method already commonly used
(Cheng et al. 2013; Gonzales et al. 2018),
but our results were inconsistent suggesting we need to do more research
into this area. More appropriate corrections are currently a topic of
research in our group and will be implemented in later iterations of
cape.
Reparameterization
The reparameterization step is where we actually do the combined
analysis of pleiotropy and epistasis that cape is named for. This step
reparameterizes the \(\beta\)
coefficients from the pairwise linear regressions across all
traits/eigentraits in terms of directed influence coefficients. These
directed influence coefficients are consistent across all traits, and
are therefore trait-independent. Cape identifies one set of genetic
interactions that explains all traits simultaneously (Carter et al. 2012).
From the pair scan, each pair of markers 1 and 2 receives a set of
\(\beta\) coefficients describing the
main effect of each marker on each eigentrait \(j\) (\(\beta^j_1\) and \(\beta^j_2\)) as well as the interaction
effect of both markers on each eigentrait (\(\beta^j_{12}\)) (See figure below). The
central idea of cape
is that these coefficients can be
combined across eigentraits and reparameterized to calculate how each
pair of markers influences each other directly and independently of
eigentrait.
The first step in this reparameterization is to define two new
parameters (\(\delta_1\) and \(\delta_2\)) in terms of the interaction
coefficients. \(\delta_1\) can be
thought of as the additional genetic activity of marker 1 when marker 2
is present. Together the \(\delta\)
terms capture the interaction term, and are interpreted as the extent to
which each marker influences the effect of the other on downstream
phenotypes. For example, a negative \(\delta_2\) indicates that the presence of
marker 2 represses the effect of marker 1 on the phenotypes or
eigentraits. The \(\delta\) terms are
related to the main effects and interaction effects as follows:
\[
\label{eqn:delta_mat_def}
\begin{bmatrix}
\beta^1_1 & \beta^1_2\\
\beta^2_1 & \beta^2_2\\
\vdots & \vdots
\end{bmatrix}
\cdot
\begin{bmatrix}
\delta_1\\
\delta_2\\
\end{bmatrix}
=
\begin{bmatrix}
\beta^1_{12}\\
\beta^2_{12}\\
\vdots
\end{bmatrix}
\]
In multiplying out this equation, it can be seen how the \(\delta\) terms influence each main effect
term to give rise to the interaction terms independent of phenotype.
\[
\label{eqn:delta_def}
\beta^j_1\delta_1 + \beta^j_2\delta_2 = \beta^j_{12}
\]
If \(\delta_1 = 0\) and \(\delta_2 = 0\), there are no addition
effects exerted by the markers when both are present. Substitution into
the equations above shows that the interaction terms \(\beta^j_{12}\) are \(0\) and thus the interaction terms have no
effect on the phenotype.
Alternatively, consider the situation when \(\delta_1 = 1\) and \(\delta_2 = 0\). The positive \(\delta_1\) indicates that marker 1 should
exert an additional effect when marker 2 is present. This can be seen
again through substitution into equation \(\ref{eqn:delta_def}\):
\[ \beta_j^1 = \beta_{12}^j \]
These non-zero terms show that there is an interaction effect between
marker 1 and marker 2. The positive \(\delta_1\) indicates that this interaction
is driven through an enhanced effect of marker 1 in the presence of
marker 2.
The \(\delta\)s are calculated by
solving for equation \(\ref{eqn:delta_mat_def}\) using matrix
inversion:
\[
\begin{bmatrix}
\delta_1\\
\delta_2\\
\end{bmatrix}
=
\begin{bmatrix}
\beta^1_1 & \beta^1_2\\
\beta^2_1 & \beta^2_2\\
\vdots & \vdots
\end{bmatrix}^{-1}
\cdot
\begin{bmatrix}
\beta^1_{12}\\
\beta^2_{12}\\
\vdots
\end{bmatrix}
\]
This inversion is exact for two eigentraits, and cape
implements pseudo-inversion for up to 12 eigentraits. We have observed
that maximum power to detect interactions is achieved with three to five
eigentraits (Tyler et al. 2017).
The \(\delta\) terms are then
translated into directed variables defining the marker-to-marker
influences \(m_{12}\) and \(m_{21}\). Whereas \(\delta_2\) described the change in activity
of marker 2 in the presence of marker 1, \(m_{12}\) can be thought of as the direct
influence of marker 2 on marker 1, with negative values indicating
repression and positive values indicating enhancement. The terms \(m_{12}\) and \(m_{21}\) are self-consistent and defined in
terms of \(\delta_1\) and \(\delta_2\):
\[
\delta_1 = m_{12}(1 + \delta_2),\;\delta_2 = m_{21}(1 + \delta_1)
\]
Rearranging these equations yields the solutions:
\[
m_{12} = \frac{\delta_1}{1 + \delta_2},\;m_{21} = \frac{\delta_2}{1 +
\delta_1}.
\]
These directed influence variables provide a map of how each marker
influences each other marker independent of phenotype. The significance
of these influences is determined through standard error analysis on the
regression parameters (Bevington 1994; Carter et
al. 2012). This step is particularly important as matrix
inversion can lead to large values but larger standard errors, yielding
insignificant results. As an example, the variance of \(m_{12}\) is calculated by differentiating
with respect to all model parameters:
\[
\sigma^{2}_{m_{12}} \cong \sum_{ij} \sigma^2_{\beta_{i}^{j}}
\Bigg(\frac{\partial m_{12}}
{\partial \beta_{i}^{j}}\Bigg)^2 + 2 \sum_{i<k, j < l}
\sigma^2_{\beta^{j}_{i}\beta^{l}_{k}} \Bigg(\frac{\partial m_{12}}
{\partial \beta^{j}_{i}} \Bigg) \Bigg(\frac{\partial m_{12}}{\partial
\beta^{l}_{k}} \Bigg)
\]
In this equation, the indices \(i\)
and \(k\) run over regression
parameters and \(j\) and \(l\) run from 1 to the number of traits.
Main Effects
In addition to identifying directed influences between genetic
markers, cape also calculates main effects. These effects are a little
different from the main effects we typically calculate. The main effect
of a locus calculated in the usual way is the main effect averaged
across all genetic backgrounds. In cape, the main effect of a marker is
derived from the set of all pairwise regressions that included that
marker. We select the maximum main effect exerted by that marker across
all pairwise tests. That is, the main effect that is reported by cape is
a main effect conditioned on another marker or covariate. The
conditioning markers and covariates are reported along with the main
effects in the VariantInfluences.csv file.
If eigentraits were scanned, the main effects can be recast in terms
of the original traits. This step is performed by multiplying the
coefficient matrices are by the singular value matrices \(V \cdot W^{T}\). With two phenotypes and
two eigentraits this conversion results in no loss of information. The
translation back to phenotype space does not affect the interactions
between markers.
CAPE Results
The primary output of cape is the file VariantInfluences.csv. This
file holds information about the interactions and main effects
identified in the data. Another file
Variant.Influences.Interactions.csv, excludes the main effects. Please
see the documentation for the function
write_variant_influences
for a description of the columns
in these files.
Cape results can also be plotted in multiple different forms. The
function plot_variant_influences
plots the results as a
heatmap. Interactions between genetic loci are shown in the main part of
the graph and main effects are shown along the right-hand side.
The direction of the influences is read from the \(y\)-axis to the \(x\)-axis. For example, there is a large
blue block near the top of the figure indicating that multiple markers
on chromosome 1 suppress the phenotypic effects of markers on chromosome
12. To see the main effects of markers on chromosome 1, follow the
chromosome 1 line all the way to the right to see that markers on
chromosome 1 increase levels of all three traits.
plot_variant_influences(final_cross, show_alleles = FALSE)
Positive effects are shown in brown, and negative effects are in
blue. Marker pairs that were not tested due to high correlation are
shown in gray. Chromosome boundaries and labels are shown along the
\(x\) and \(y\) axes.
The function plot_network
plots cape results as a
circular network. Here chromosomes are arranged in a circle with traits
in concentric circles around the chromosomes. Main effects are depicted
in these concentric circles and genetic interactions are shown as arrows
within the circle.
plot_network(final_cross)
And finally, the function plot_full_network
This
function plots the same results in a network that ignores genomic
position. This view accentuates the structure of the genetic interaction
network.
Each node is one genetic locus or covariate. Each is plotted as a pie
chart with one trait per section. Significant effects are indicated by
coloring the sections corresponding to the traits either brown for
positive main effects or blue for negative main effects. Gray indicates
no significant main effect. Interactions are shown as arrows between the
nodes.
Interpreting CAPE Results
Both the network figure and the adjacency matrix show direct
influences of markers on the phenotypes as well as interactions between
markers. As expected, many NZO variants on multiple chromosomes show
positive effects on plasma insulin and glucose levels as well as on body
weight.
We can plot individual interactions from the Variant Influences table
to get a more detailed look at some of these interactions. There is a
new plotting function in cape called plot_effects
that can
be used to plot the phenotypic effects of individual interactions in
multiple different styles.
For example, there is a positive interaction between Chrs 2 and 15 in
which the presence of the NZO allele on Chr 2 enhances the positive
influence of the NZO allele on all traits.
We can look at both the main effects and the interaction effects
using plot_effects
. First to verify the positive effect of
the marker on Chr 15, we can plot it by itself. Here we use
plot_type = l
which generates a line plot.
plot_effects(data_obj = final_cross, geno_obj = obesity_geno,
marker1 = "D15Mit72_B", marker1_label = "Chr15", plot_type = "l",
error_bars = "se")
We can then look at both markers together to see the enhanced effect
of this allele when the Chr 2 NZO allele is present.
plot_effects(data_obj = final_cross, geno_obj = obesity_geno,
marker1 = "D2Mit120_B", marker2 = "D15Mit72_B", marker1_label = "Chr2",
marker2_label = "Chr15", plot_type = "l", error_bars = "se")
We can see from this plot that the effect of the Chr 15 NZO allele is
negligible when the Chr 2 NZO allele is present, but is quite
substantial when the Chr 2 NZO allele is present. Thus the NZO allele on
Chr 2 enhances the positive effect of the Chr 15 allele across all
traits.
We can look at this with other types of visualization. The bar plot
below shows the same data in a different form. We can see here that the
Chr 2 allele has a negative effect on the trait, while the Chr 15 allele
has a small positive effect. However, when both NZO alleles are present,
all trait values are higher than expected from the additive effects
alone.
plot_effects(data_obj = final_cross, geno_obj = obesity_geno,
marker1 = "D2Mit120_B", marker2 = "D15Mit72_B", marker1_label = "Chr2",
marker2_label = "Chr15", plot_type = "b", error_bars = "se")
There are several other plotting types that are worth exploring when
looking at different types of interactions, including heat maps, and
points for individual values.
These findings illustrate how cape
is designed to find
interactions that simultaneously model all phenotypes under the
assumption that interactions between variants across multiple contexts
represent a single underlying interaction network. Thus we recommend
users assess single-phenotype epistasis using functions in
cape
or in parallel analyses using tools such as R/qtl2 and
R/qtlbim (Yandell et al. 2007).
List of Plotting Functions
The following list describes the plotting functions available in
cape. They are ordered by where you would probably use them in a cape
analysis.
hist_pheno: Plots trait distributions as histograms.
Helps identify whether traits are normally distributed.
qnorm_pheno: Plots trait quantiles compared to
theoretical normal quantiles. Helps identify whether traits are normally
distributed.
plot_pheno_cor: Plots the correlation between pairs
of traits. Includes individual trait distributions, correlation plots,
and correlation values. Points can be color-coded by a covariate or
another trait.
plot_svd: Plots the results of the singular value
decomposition that generates the eigentraits. Shows the total trait
variance explained by each eigentrait, as well as the contribution of
each trait to each eigentrait.
plot_singlescan: Plots the results of the
single-locus scan. Can plot either standardized effects or
non-standardized effects. Also plots effects both by allele and by
marker.
plot_pairscan: Plots the results of the pairwise
scan.
plot_variant_influences: Plots cape results as a
heatmap. Genetic interactions are plotted in the main part of the
heatmap, and main effects are plotted along the right-hand side.
Positive and negative effects are distinguished by color. This plot is
useful for identifying overall trends in interaction networks,
particularly dense interaction networks.
plot_network: Plots cape results as a circular
network. Genetic markers are laid out in a circle with traits in
concentric circles around the chromosomes. Main effects are indicated in
the trait circles. Interactions are drawn as arrows between genetic
markers or between genetic markers and covariates. This view is good for
identifying patterns on a finer scale than the heat map view,
particularly for sparse networks. This view is not very informative for
dense networks.
plot_full_network: Plots cape results in a standard
network layout. Each marker is a node, and its main effects are shown in
a pie chart in which each section of the pie is a trait. Interactions
are shown as arrows between nodes. This view is good for looking at
overall network structure independent of genomic position. This view can
be informative both for dense and sparse networks. The function contains
many arguments for adjusting the layout of the network for better
visualization.
plot_effects: Plots the effects of individual
interactions on traits. This function is for exploring individual
interactions in more detail. It can plot main effects and interaction
effects as lines, points, bars, or heatmaps. To select individual
interactions to plot, use the file plot.variant.influences.csv or
Variant.Influences.Interactions.csv. The function
plot_variant_influences can also return these tables invisibly without
writing to a file.
List of Output Files
The following list describes each output file produced by
run_cape
. Where we use cross
below, this
stands in for the user-defined experiment name. The default experiment
name in cape is “cross.”
cross_geno.RData: The genotype object used in the
cape run. This will have any markers on the X, Y, or mitochondrial
chromosomes removed, and any underscores removed from marker names. This
genotype object can be used in post-cape plotting functions.
cross.pairscan.RData: The results of the pairwise
scan. It consists of a list with the following five elements: *
ref_allele: The allele used as the reference for the tests. *
max_pair_cor: The maximum pairwise correlation between marker pairs *
pairscan_results: A list with one element per trait. The element for
each trait is a list of the following three elements: *
pairscan_effects: the effect sizes from the linear models * pairscan_se:
the standard errors from the linear models * model_covariance: the model
covariance from the linear models. * pairscan_perm: The same structure
as pairscan_results, but for the permuted data. * pairs_tested_perm: A
matrix of the marker pairs used in the permutation tests.
cross.parameters.yml The parameter file dictating
all the parameters for the cape run.
cross.RData The data object. This object holds all
the trait data, and cape results. It is used as input for almost all
cape functions.
cross.singlescan.RData: The results of the
single-locus scan. The results are a list of the following seven
elements: * alpha: The alpha values set in the argument alpha *
alpha_thresh: The calculated effect size thresholds at each alpha if
permutations are run. * ref_allele: The allele used as the reference
allele * singlescan_effects: The effect sizes (beta coefficients) from
the single-locus linear models * singlescan_t_stats: The t statistics
from the single-locus linear models * locus.p_vals: Marker-level p
values * locus_score_scores: Marker-level test statistics.
Network.Circular.jpg A jpg file containing the
circular cape results plot generated by plot_network
Network.Circular.pdf: A pdf file containing the
circular cape results plot generated by plot_network
Network.View.jpg: A jpg file containing the cape
results plot generated by plot_full_network
.
Network.View.pdf: A pdf file containing the cape
results plot generated by plot_full_network
.
Pairscan.Regression.jpg: A jpg file containing the
results of the pairwise scan plotted by plot_pairscan
.
Pairscan.Regression.pdf: A pdf file containing the
results of the pairwise scan plotted by plot_pairscan
.
permutation.data.RData The results of the
permutations from the function pairscan
.
Singlescan.Effects.jpg The results of the
single-locus scan. All allele effects are plotted in this plot, as well
as the effects for each locus overall. Multiple allele effects are only
plotted for multi-parent populations. Otherwise, the allele plot looks
similar to the locus plot. There will be one singlescan effects plot for
each trait, and the trait name will be included in the filename. This
plot is generated by plot_singlescan
.
Singlescan.ET1.Standardized.jpg: The results of the
single-locus scan. In contrast to the file above, this plot shows only
the standardized allele effects for each locus. Similar to the file
above, there will be one effects plot for each trait, and the trait name
will be included in the filename. This plot is generated by
plot_singlescan
.
svd.jpg: A jpg file showing the results of the
singular value decomposition of the traits, if that step was performed.
The plot is generated by plot_svd
. It shows the total trait
variance explained by each eigentrait, as well as the contribution of
each trait to each eigentrait.
svd.pdf A pdf version of svd.jpg.
Variant.Influences.csv: A csv file holding the
results of cape, and written by the function `write_variant_influences’
This file contains both main effects and interactions.
Variant.Influences.Interactions.csv: A csv file
holding the results of cape, and written by the function
`write_variant_influences’ This file contains only the interactions.
variant.influences.jpg: A jpg showing cape results
as plotted by plot_variant_influences
.
variant.influences.pdf: A pdf showing cape results
as plotted by plot_variant_influences
.
References
Bevington, P. R. 1994. Data Reduction and Error Analysis for the
Physical Sciences, Ise. New York: McGraw-Hill.
Broman, Karl W, Daniel M Gatti, Petr Simecek, Nicholas A Furlotte, Pjotr
Prins, Śaunak Sen, Brian S Yandell, and Gary A Churchill. 2019.
“R/Qtl2: Software for Mapping Quantitative Trait Loci with
High-Dimensional Data and Multiparent Populations.”
Genetics 211 (2): 495–502.
Broman, Karl W., Hao Wu, Saunak Sen, and Gary A. Churchill. 2003.
“R/Qtl: QTL Mapping in Experimental Crosses.”
Bioinformatics 19: 889–90.
Burcelin, Rémy, Valérie Crivelli, Anabela Dacosta, Alexandra
Roy-Tirelli, and Bernard Thorens. 2002. “Heterogeneous metabolic adaptation of C57BL/6J mice to
high-fat diet.” American Journal of Physiology.
Endocrinology and Metabolism 282 (4): E834–42.
Carter, Gregory W, Michelle Hays, Amir Sherman, and Timothy Galitski.
2012. “Use of pleiotropy to model genetic
interactions in a population.” PLoS Genetics 8
(10): e1003010.
Cheng, Riyan, Clarissa C Parker, Mark Abney, and Abraham A Palmer. 2013.
“Practical Considerations Regarding the Use of Genotype and
Pedigree Data to Model Relatedness in the Context of Genome-Wide
Association Studies.” G3: Genes, Genomes, Genetics 3
(10): 1861–67.
Chesler, Elissa J, Darla R Miller, Lisa R Branstetter, Leslie D
Galloway, Barbara L Jackson, Vivek M Philip, Brynn H Voy, et al. 2008.
“The Collaborative Cross at Oak Ridge National Laboratory:
Developing a Powerful Resource for Systems Genetics.”
Mammalian Genome 19 (6): 382–89.
Das, Swapan Kumar, and Steven C Elbein. 2006. “The Genetic Basis of Type 2 Diabetes.”
Cellscience 2 (4): 100–131.
Gelman, Andrew, and Jennifer Hill. 2006. Data Analysis Using
Regression and Multilevel/Hierarchical Models. Cambridge University
Press.
Gonzales, Natalia M, Jungkyun Seo, Ana I Hernandez Cordero, Celine L St
Pierre, Jennifer S Gregory, Margaret G Distler, Mark Abney, Stefan
Canzar, Arimantas Lionikas, and Abraham A Palmer. 2018. “Genome
Wide Association Analysis in a Mouse Advanced Intercross Line.”
Nature Communications 9 (1): 1–12.
Haffner, S. 2003. “Epidemic Obesity and the
Metabolic Syndrome.” Circulation 108 (13):
1541–45.
Kang, Hyun Min, Noah A Zaitlen, Claire M Wade, Andrew Kirby, David
Heckerman, Mark J Daly, and Eleazar Eskin. 2008. “Efficient control of population structure in model
organism association mapping.” Genetics 178 (3):
1709–23.
Permutt, M Alan, Jonathon Wasson, and Nancy Cox. 2005. “Genetic epidemiology of diabetes.” The
Journal of Clinical Investigation 115 (6): 1431–39.
Purcell, Shaun, Benjamin Neale, Kathe Todd-Brown, Lori Thomas, Manuel AR
Ferreira, David Bender, Julian Maller, et al. 2007. “PLINK: A Tool
Set for Whole-Genome Association and Population-Based Linkage
Analyses.” The American Journal of Human Genetics 81
(3): 559–75.
Reifsnyder, P C. 2000. “Maternal Environment
and Genotype Interact to Establish Diabesity in Mice.”
Genome Research 10 (10): 1568–78.
Tyler, Anna L, Bo Ji, Daniel M Gatti, Steven C Munger, Gary A Churchill,
Karen L Svenson, and Gregory W Carter. 2017. “Epistatic Networks
Jointly Influence Phenotypes Related to Metabolic Disease and Gene
Expression in Diversity Outbred Mice.” Genetics 206 (2):
621–39.
Yandell, Brian S, Tapan Mehta, Samprit Banerjee, Daniel Shriner,
Ramprasad Venkataraman, Jee Young Moon, W Whipple Neely, Hao Wu, Randy
von Smith, and Nengjun Yi. 2007. “R/qtlbim:
QTL with Bayesian Interval Mapping in experimental
crosses.” Bioinformatics 23 (5): 641–43.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.