The sim_reg
function uses MCMC to sample parameters from the ‘Phenotype similarity regression’ model conditional on supplied binary genotypes and ontological term sets.
The vignette ‘Similarity Regression - Introduction’ shows a simple application based on simulated data. This vignette demonstrates how to include prior phenotypic information in the inference. The information is supplied to the inference procedure as a parameter called lit_sims
and should be a numeric vector of relative weights for terms included in the sample space of phi
(by default the set of all terms present amongst the terms in x
and their ancestors). Terms not specified in this vector default to having a weight of 1.
library(ontologyIndex)
library(ontologySimilarity)
library(SimReg)
data(hpo)
set.seed(1)
terms <- get_ancestors(hpo, c(hpo$id[match(c("Abnormality of thrombocytes","Hearing abnormality"),
hpo$name)], sample(hpo$id, size=50)))
To help illustrate the ideas, we’ll consider a scenario where there is some evidence for an association between a phenotype - which in this example we’ll set as ‘Hearing abnormality’ - and a binary genotype, and where there is also a prior expectation that the ‘characteristic phenotype’ of the genotype would involve hearing abnormality. For instance, the genotype might depend on variants in a gene orthologous to one known to harbour ‘Hearing abnormality’ variants in a model organism. We apply the inference procedure to the data with and without the prior and observe the effect on the inferred ‘probability of association’, gamma
.
hearing_abnormality <- hpo$id[match("Hearing abnormality", hpo$name)]
genotypes <- c(rep(TRUE, 3), rep(FALSE, 97))
#give all subjects 5 random terms and add 'hearing abnormality' for those with y_i=TRUE
phenotypes <- lapply(genotypes, function(y_i) minimal_set(hpo, c(
if (y_i) hearing_abnormality else character(0), sample(terms, size=5))))
So there are three cases with the rare variant (i.e. having y_i = TRUE) and all of them have the ‘Hearing abnormality’ HPO term.
An application of sim_reg
yields:
samples <- sim_reg(ontology=hpo, x=phenotypes, y=genotypes)
print(summary(samples), ontology=hpo)
## ---------------------------------------------------------------------------
## P(gamma=1|y) = 0.1038125
## ---------------------------------------------------------------------------
## Numeric parameters:
## Parameter Mean SD
## alpha -5.61 2.05
## log_beta 2.83 0.97
## logit_mean_f 0.38 0.89
## log_alpha_plus_beta_f 2.10 0.97
## logit_mean_g -1.31 1.42
## log_alpha_plus_beta_g 1.91 1.05
## ---------------------------------------------------------------------------
## Phi:
## t Name P
## HP:0000598 Abnormality of the ear 0.43
## HP:0000364 Hearing abnormality 0.39
## HP:0002817 Abnormality of the upper limb 0.10
## HP:0040068 Abnormality of limb bone 0.08
## HP:0003308 Cervical subluxation 0.08
## HP:0012387 Bronchitis 0.07
## HP:0000925 Abnormality of the vertebral column 0.07
## HP:0000164 Abnormality of the teeth 0.07
## HP:0011030 Abnormality of transition element cation homeostasis 0.05
## HP:0011842 Abnormality of skeletal morphology 0.04
## ---------------------------------------------------------------------------
We now consider constructing the lit_sims
parameter to capture our knowledge about the gene from the model organism. We can either explicitly create the lit_sims
vector of prior weights, i.e. by assigning higher weights to terms which involve some kind of hearing problem. For example, we could set the prior weight of all terms which have the word ‘hearing’ in to ten times that of terms which don’t.
lit_sims <- ifelse(grepl(x=hpo$name, ignore=TRUE, pattern="hearing"), 10, 1)
names(lit_sims) <- hpo$name
Note: one must set the names
of the lit_sims
vector, as sim_reg
will use it.
If the prior knowledge of the phenotype/phenotype of the model organism has been ontologically encoded (for example, it may be available as MPO terms from the Mouse Genome Informatics (MGI) website, http://www.informatics.jax.org/, [1]), another option is to use a phenotypic similarity function to obtain the numeric vector of weights for inclusion of terms in phi
[2]. This may be more convenient, particularly when dealing with large numbers of genes. In the SimReg paper [2], the vector is set by exponentiating the Resnik-based [3] similarities of terms to the terms in the ‘literature phenotype’. In order to calculate the similarities based on Resnik’s similarity measure, we must first compute an ‘information content’ for the terms, equal to the negative log frequency. The frequencies can be calculated with respect to different collections of phenotypes. Here, we will calculate it with respect to the frequencies of terms within our collection, phenotypes
, by calling the function exported by the ontologyIndex
package, get_term_info_content
. Note, it could also be calculated with respect to the frequency of the term amongst the ontological annotation of OMIM diseases (available from the HPO website, http://human-phenotype-ontology.github.io [4]). The function get_term_set_to_term_sims
in the package ontologySimilarity
can then be used to calculate the similarities between the terms in the sample space of phi
and the literature_phenotype
. It calculates a matrix of similarities between the individual terms in the literature phenotype and terms in the sample space. Let’s say the phenotype of the model organism in our example contains abnormalities of the thrombocytes and hearing abnormality.
thrombocytes <- hpo$id[match("Abnormality of thrombocytes", hpo$name)]
literature_phenotype <- c(hearing_abnormality, thrombocytes)
info <- get_term_info_content(hpo, phenotypes)
lit_sims_resnik <- apply(exp(get_term_set_to_term_sims(
get_term_sim_mat(hpo, info, method="resnik"),
literature_phenotype)), 2, mean)
This can then be passed to sim_reg
through the lit_sims
parameter.
with_prior_samples <- sim_reg(
ontology=hpo,
x=phenotypes,
y=genotypes,
lit_sims=lit_sims_resnik
)
print(summary(with_prior_samples), ontology=hpo)
## ---------------------------------------------------------------------------
## P(gamma=1|y) = 0.2786875
## ---------------------------------------------------------------------------
## Numeric parameters:
## Parameter Mean SD
## alpha -5.79 2.00
## log_beta 2.92 0.90
## logit_mean_f 0.23 0.83
## log_alpha_plus_beta_f 2.03 0.97
## logit_mean_g -1.56 1.25
## log_alpha_plus_beta_g 1.87 1.04
## ---------------------------------------------------------------------------
## Phi:
## t Name P
## HP:0000364 Hearing abnormality 0.48
## HP:0000598 Abnormality of the ear 0.38
## HP:0002817 Abnormality of the upper limb 0.15
## HP:0003308 Cervical subluxation 0.06
## HP:0006527 Lymphoid interstitial pneumonia 0.05
## HP:0010831 Impaired proprioception 0.05
## HP:0011294 EEG with frontal sharp waves 0.04
## HP:0011314 Abnormality of long bone morphology 0.04
## HP:0002813 Abnormality of limb bone morphology 0.04
## HP:0012252 Abnormal respiratory system morphology 0.04
## ---------------------------------------------------------------------------
Note that including the lit_sims
parameter has increased the mean posterior value of gamma
.
Often the binary genotype relates to a particular gene, and for many genes ontologically encoded phenotypes are available either in the form of HPO encoded OMIM annotations [4] or MPO annotations [1]. For a given set of subjects with HPO-coded phenotypes, it may be useful to apply the inference gene-by-gene, taking the binary genotype y
to indicate the presence of a rare variant in each particular gene for each case. Thus, we may wish to systematically create informative prior distributions for phi
for all genes. This can be done by downloading the file called ‘ALL_SOURCES_TYPICAL_FEATURES_genes_to_phenotype.txt’ from the HPO website, and running the following code yielding a list of term sets (i.e. character vectors of HPO term IDs).
annotation_df <- read.table(header=FALSE, skip=1, sep="\t",
file="ALL_SOURCES_TYPICAL_FEATURES_genes_to_phenotype.txt", stringsAsFactors=FALSE, quote="")
hpo_by_gene <- lapply(split(f=annotation_df[,2], x=annotation_df[,4]),
function(trms) minimal_set(hpo, intersect(trms, hpo$id)))