SimReg
has a function sim_reg
for performing Bayesian Similarity Regression [1]. The aim is to estimate the probability of an association between ontological term sets and a binary variable, and should such an association exist, a characteristic ontological profile such that ontological similarity to the profile increases the probability of the binary variable taking the value TRUE
. The procedure has been used in the context of linking an ontologically encoded phenotype (as HPO terms) to a binary genotype (indicating the presence or absence of a rare variant within given genes) [1], so in this guide, we’ll adopt the same theme.
The function is an MCMC routine which accepts many arguments including the response variable y
(logical
), the ontologically encoded predictor variable x
(a list
), ones controlling the sampling pattern, ones controlling the tuning of the parameter proposal schemes and others specifying the prior distributions of the parameters.
It returns a list of traces for the sampled parameters as an object of class sim_reg_samples
(typically, this requires a lot of memory, and hence it may be preferable to store a summarised version by passing it to summary
). Of particular interest is the estimated probability of association - i.e. the posterior mean value of gamma
- the model selection indicator. This can be obtained by calling the function P(gamma=1)
on the returned object (or on the summarised object). Also, the posterior distribution of the characteristic ontological profile phi
may be of interest.
Most of the arguments have default values, so calling of the function can be demonstrated with passing only a few. To set up an environment where we can run a simple example, we need an ontology_index
object. The ontologyIndex
package contains such an instance - the Human Phenotype Ontology, hpo
- so we load the package and the data, and proceed to create an HPO profile template
and an enclosing set of terms, terms
, from which we’ll generate random term sets upon which to apply the function. In our setting, we’ll interpret this HPO profile template
as the typical phenotype of a hypothetical disease. We set template
to the set HP:0005537, HP:0000729
and HP:0001873
, corresponding to phenotype abnormalities ‘Decreased mean platelet volume’, ‘Autistic behavior’ and ‘Thrombocytopenia’ respectively.
library(ontologyIndex)
library(SimReg)
data(hpo)
set.seed(1)
template <- c("HP:0005537", "HP:0000729", "HP:0001873")
terms <- get_ancestors(hpo, c(template, sample(hpo$id, size=50)))
First, we’ll do an example where there is no association between x
and y
, and then one where there is an association.
In the example with no association, we’ll fix y
, with 10 TRUE
s and generate the x
randomly, with each set of ontological terms determined by sampling 5 random terms from terms
.
y <- c(rep(TRUE, 10), rep(FALSE, 90))
x <- replicate(simplify=FALSE, n=100, expr=minimal_set(hpo, sample(terms, size=5)))
Thus, our input data looks like:
y
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE
head(x)
## [[1]]
## [1] "HP:0006660" "HP:0030157" "HP:0001697" "HP:0011792" "HP:0002814"
##
## [[2]]
## [1] "HP:0001844" "HP:0003296" "HP:0010760" "HP:0001155" "HP:0000245"
##
## [[3]]
## [1] "HP:0010244" "HP:0011277" "HP:0045037" "HP:0012372" "HP:0003468"
##
## [[4]]
## [1] "HP:0002011" "HP:0006660" "HP:0004313" "HP:0040068" "HP:0012612"
##
## [[5]]
## [1] "HP:0011912" "HP:0011492" "HP:0006711"
##
## [[6]]
## [1] "HP:0010577" "HP:0012612" "HP:0030791" "HP:0001977" "HP:0000309"
Now we can call the sim_reg
function to estimate the probability of an association (note: by default, the probability of an association has a prior of 0.05 and this can be set by passing a gamma_prior_prob
argument), and print the mean posterior value of gamma
, corresponding to our estimated probability of association.
no_assoc <- sim_reg(ontology=hpo, x=x, y=y)
`P(gamma=1)`(no_assoc)
## [1] 0.0463125
We note that there is a low probability of association. Now, we sample x
conditional on y
, so that if y[i] == TRUE
, then x
has 2 out of the 3 terms in template
added to its profile.
x_assoc <- lapply(y, function(y_i) minimal_set(hpo, c(
sample(terms, size=5), if (y_i) sample(template, size=2))))
If we look again at the first few values in x
for which y[i] == TRUE
, we notice that they contain terms from the template.
head(x_assoc)
## [[1]]
## [1] "HP:0006505" "HP:0000301" "HP:0002107" "HP:0000164" "HP:0001873"
## [6] "HP:0000729"
##
## [[2]]
## [1] "HP:0011912" "HP:0007359" "HP:0000164" "HP:0012243" "HP:0000889"
## [6] "HP:0000729" "HP:0005537"
##
## [[3]]
## [1] "HP:0011181" "HP:0003043" "HP:0004328" "HP:0010587" "HP:0030067"
## [6] "HP:0005537" "HP:0001873"
##
## [[4]]
## [1] "HP:0010228" "HP:0006737" "HP:0005368" "HP:0001873" "HP:0000729"
##
## [[5]]
## [1] "HP:0006500" "HP:0100508" "HP:0030065" "HP:0100277" "HP:0005537"
## [6] "HP:0001873"
##
## [[6]]
## [1] "HP:0100374" "HP:0011876" "HP:0030063" "HP:0001873" "HP:0000729"
Now we run the procedure again with the new x
and y
and print the mean posterior value of gamma
.
assoc <- sim_reg(ontology=hpo, x=x_assoc, y=y)
`P(gamma=1)`(assoc)
## [1] 1
We note that we infer a higher probability of association. We can also visualise the estimated characteristic ontological profile, using the function phi_plot
, and note that the inferred characteristic phenotype corresponds well to template
.
phi_plot(hpo, assoc$phi[assoc$gamma], max_terms=8, fontsize=30)
Note that we must subset the $phi
slot by $gamma
, as the characteristic ontological profile phi
has no effect if gamma == FALSE
. A more comprehensive summary of the output can be exported to pdf using the function sim_reg_summary
.
summary
can also be called print out information about the inference.
print(summary(assoc), ontology=hpo)
## ---------------------------------------------------------------------------
## P(gamma=1|y) = 1
## ---------------------------------------------------------------------------
## Phi:
## t Name P
## HP:0000708 Behavioral abnormality 0.49
## HP:0005537 Decreased mean platelet volume 0.47
## HP:0000729 Autistic behavior 0.47
## HP:0011876 Abnormal platelet volume 0.44
## HP:0011873 Abnormal platelet count 0.40
## HP:0001873 Thrombocytopenia 0.36
## HP:0000271 Abnormality of the face 0.04
## HP:0001872 Abnormality of thrombocytes 0.02
## HP:0000119 Abnormality of the genitourinary system 0.02
## HP:0009121 Abnormal axial skeleton morphology 0.02
## ---------------------------------------------------------------------------