The bioinformatic evaluation of gene co-expression often begins with correlation-based analyses. However, as demonstrated thoroughly in a recent publication, this approach lacks statistical validity when applied to relative data (Lovell 2015). This includes, for example, some of the most frequently studied biological count data, such as those produced by microarray assays or high-throughput RNA-sequencing. As an alternative to correlation, Lovell et al propose a proportionality metric, \(\phi\), as derived from compositional data (CoDa) analysis. A subsequent publication expounded this work by elaborating on another proportionality metric, \(\rho\) (Erb 2016). This package introduces a programmatic framework for the calculation of feature dependence through proportionality, as discussed in the cited publications.
Let \(A_i\) and \(A_j\) each represent a log-ratio transformed feature vector (e.g., a transformed vector of \(d\) gene values measured across \(n\) conditions). We then define the metrics \(\phi\) and \(\rho\) accordingly:
\[\phi(A_i, A_j) = \frac{var(A_i - A_j)}{var(A_i)}\]
\[\rho(A_i, A_j) = 1 - \frac{var(A_i - A_j)}{var(A_i) + var(A_j)}\]
Above, we use the log-ratio transformation in order to normalize the data in a manner that respects the nature of relative data. In other words, log-ratio transformation yields the same result whether applied to absolute or relative data. In this package, we consider two log-ratio transformations of the subject vector \(x\), the centered log-ratio transformation (clr) and the additive log-ratio transformation (alr). We define the metrics \(clr(x)\) and \(alr(x)\) accordingly:
\[\textrm{clr(x)} = \left[\ln\frac{x_i}{g(\textrm{x})};...;\ln\frac{x_D}{g(\textrm{x})}\right]\]
\[\textrm{alr(x)} = \left[\ln\frac{x_i}{x_D};...;\ln\frac{x_{D-1}}{x_D}\right]\]
In clr-transformation, sample vectors undergo normalization based on the logarithm of the ratio between the individual elements and the geometric mean of the vector, \(g(\textrm{x}) = \sqrt[D]{x_i...x_D}\). In alr-transformation, sample vectors undergo normalization based on the logarithm of the ratio between the individual elements and chosen reference feature. Although these transformations differ in definition, we will sometimes will refer to them jointly by the acronym *lr.
We provide two principal functions for calculating proportionality. The first function, phit
, implements the calculation of \(\phi\) described in Lovell et al (2015). This function makes use of clr-transformation exclusively. The second function, perb
, implements the calculation of \(\rho\) described initially in Lovell et al (2015) and expounded by Erb and Notredame (2016). This function makes use of either clr- or alr-transformation.
The first difference between \(\phi\) and \(\rho\) is scale. The values of \(\phi\) range from \([0, \infty)\), with lower \(\phi\) values indicating more proportionality. The values of \(\rho\) range from \([-1, 1]\), with greater \(|\rho|\) values indicating more proportionality and negative \(\rho\) values indicating inverse proportionality. A second difference is that \(\phi\) lacks symmetry. However, one can force symmetry by reflecting the lower left triangle of the matrix across the diagonal (toggled by the argument symmetrize = TRUE
). A third difference is that \(\rho\) corrects for the individual variance of each feature in the pair, rather than for just one of the features as in \(\phi\).
For now, we will focus on the implementations that use clr-transformation, saving a discussion of alr-transformation for later. Let us begin by building an arbitrary dataset of 4 features (e.g., genes) measured across 100 subjects. In this example dataset, the feature pairs “a” and “b” will show proportional change as well as the feature pairs “c” and “d”.
set.seed(12345)
N <- 100
X <- data.frame(a=(1:N), b=(1:N) * rnorm(N, 10, 0.1),
c=(N:1), d=(N:1) * rnorm(N, 10, 1.0))
Let \(d\) represent any number of features measured across \(n\) observations undergoing a binary or continuous event \(E\). For example, \(n\) could represent subjects differing in case-control status, treatment status, treatment dose, or time. The phit
and perb
functions ultimately convert a “count matrix” with \(n\) rows and \(d\) columns into a proportionality matrix of \(d\) rows and \(d\) columns containing a \(\phi\) or \(\rho\) measurement for each feature pair. One can think of this matrix as analogous to a dissimilarity matrix (in the case of \(\phi\)) or a correlation matrix (in the case of \(\rho\)). Both functions return the proportionality matrix bundled within an object of the class propr
. This object contains four slots:
@counts
A matrix. Stores the original “count matrix” input.@logratio
A matrix. Stores the log-ratio transformed “count matrix”.@matrix
A matrix. Stores the proportionality metrics, \(\phi\) or \(\rho\).@pairs
A vector. Indexes the proportionality metrics of interest.library(propr)
phi <- phit(X, symmetrize = TRUE)
rho <- perb(X, ivar = 0)
We have provided methods for indexing and subsetting objects belonging to the propr
class. Using the familiar [
method, we can efficiently index the proportionality matrix (@matrix
) based on an inequality operator and a reference value. By design, this method never modifies the proportionality matrix, making it scale well with large datasets.
In this first example, we use [
to index the matrix by \(\rho > .99\). This indexes the location of all values (i.e., in the lower left triangle of the matrix) satisfying that inequality, and saves those indices to the @pairs
slot. Indexing helps guide bundled visualization methods in lieu of copy-on-modify subsetting.
rho99 <- rho[">", .99]
rho99@pairs
## [1] 2 12
Alternatively, using the subset
method, we can subset an entire propr
object by a vector of feature indices or names. The subset
method also provides a convenient way to re-order feature and subject vectors for downstream visualization tools (e.g., image
). However, this method does copy-on-modify the proportionality matrix, making it unsuitable for large datasets.
In this second example, we subset by the feature names “a” and “b”.
rhoab <- subset(rho, select = c("a", "b"))
rhoab@matrix
## [,1] [,2]
## [1,] 1.0000000 0.9999151
## [2,] 0.9999151 1.0000000
The convenience function, simplify
, can subset an entire propr
object based on the index saved to its @pairs
slot. This function converts the saved index into a paired list of coordinates and passes them along to the subset
method. As such, this method does copy-on-modify the proportionality matrix, making it unsuitable for large datasets. Unlike subset
, simplify
returns an object with the @pairs
slot updated.
simplify(rho99)
## @counts summary: 100 subjects by 4 features
## @logratio summary: 100 subjects by 4 features
## @matrix summary: 4 features by 4 features
## @pairs summary: 2 feature pairs
Each feature belonging to a highly proportional data pair should show approximately linearly correlated *lr-transformed expression with one another across all subjects. The method plot
provides a means by which to visually inspect whether this holds true. Since this function will plot all pairs unless indexed with the [
method, we recommend the user first index or subset the propr
object before plotting. “Noisy” correlation between some feature pairs could suggest that the proportionality cutoff is too lenient. We include this plot as a handy “sanity check” when working with high-dimensional datasets.
plot(rho99)
Both microarray technology and high-throughput genomic sequencing have the ability to measure tens of thousands of features for each subject. Since calculating proportionality generates a matrix sized \(d^2\), this method uses a lot of RAM when applied to real biological datasets. To address this issue, the newest version of propr
harnesses the power of C++ (via the Rcpp
package) to achieve a near 100-fold increase in computational speed and an 80% reduction in RAM overhead. Below, we provide a small table that estimates the approximate amount of RAM needed to render a proportionality matrix based on the number of features studied. The user should account for up to 25% more MiB in additional RAM for subsequent [
indexing and visualization.
Features | Peak RAM (MiB) |
---|---|
1000 | 8 |
2000 | 31 |
4000 | 123 |
8000 | 491 |
16000 | 1959 |
24000 | 4405 |
32000 | 7829 |
64000 | 31301 |
100000 | 76406 |
We recognize that this package builds off concepts that are not necessarily intuitive. Since the log-ratio transformation of relative data comprises a major portion of proportionality analysis, we decided to dedicate some extra space to this topic specifically. In this section, we discuss the centered log-ratio (clr) and its limitations in context of proportionality analysis. To this end, we begin by simulating count data for 5 features (e.g., genes) labeled “a”, “b”, “c”, “d”, and “e”, as measured across 100 subjects.
N <- 100
a <- seq(from = 5, to = 15, length.out = N)
b <- a * rnorm(N, mean = 1, sd = 0.1)
c <- rnorm(N, mean = 10)
d <- rnorm(N, mean = 10)
e <- rep(10, N)
X <- data.frame(a, b, c, d, e)
Let us assume that these data \(X\) represent absolute abundance counts (i.e., not relative data). We can build a relative dataset, \(Y\), by distorting \(X\) accordingly:
Y <- X / rowSums(X) * abs(rnorm(N))
As a “sanity check”, we will confirm that these new feature vectors do in fact contain relative quantities. We do this by calculating the ratio of the second feature vector to the first for both the absolute and relative datasets.
all(round(X[, 2] / X[, 1] - Y[, 2] / Y[, 1], 5) == 0)
## [1] TRUE
The following figures compare pairwise scatterplots for the absolute count data and the corresponding relative count data. We see quickly how these relative data suggest a spurious correlation: although genes “c” and “d” do not correlate with one another absolutely, their relative quantities do.
pairs(X)
pairs(Y)
Next, we will see that when we do calculate correlation, the coefficients differ for the absolute and relative datasets. This further demonstrates the spurious correlation.
cor(X)
## a b c d e
## a 1.00000000 0.9495487 -0.08429201 -0.1284406 NA
## b 0.94954870 1.0000000 -0.17278967 -0.1183455 NA
## c -0.08429201 -0.1727897 1.00000000 -0.1271698 NA
## d -0.12844062 -0.1183455 -0.12716985 1.0000000 NA
## e NA NA NA NA 1
cor(Y)
## a b c d e
## a 1.0000000 0.9918545 0.8606885 0.8700002 0.8630598
## b 0.9918545 1.0000000 0.8553602 0.8677473 0.8622694
## c 0.8606885 0.8553602 1.0000000 0.9857120 0.9923988
## d 0.8700002 0.8677473 0.9857120 1.0000000 0.9909547
## e 0.8630598 0.8622694 0.9923988 0.9909547 1.0000000
However, by calculating the variance of the log-ratios (vlr), defined as the variance of the logarithm of the ratio of two feature vectors, we can arrive at a single measure of dependence that (a) does not change with respect to the nature of the data (i.e., absolute or relative), and (b) does not change with respect to the number of features included in the computation. As such, the vlr, constituting the numerator portion of the \(\phi\) metric and a portion of the \(\rho\) metric as well, is sub-compositionally coherent. Yet, while vlr yields valid results for compositional data, it lacks a meaningful scale.
propr:::proprVLR(Y[, 1:4])
## a b c d
## a 0.000000000 0.009007394 0.11273963 0.11192702
## b 0.009007394 0.000000000 0.12431341 0.11769259
## c 0.112739635 0.124313413 0.00000000 0.01986009
## d 0.111927021 0.117692593 0.01986009 0.00000000
propr:::proprVLR(X)
## a b c d e
## a 0.000000000 0.009007394 0.112739635 0.111927021 0.097960496
## b 0.009007394 0.000000000 0.124313413 0.117692593 0.104219359
## c 0.112739635 0.124313413 0.000000000 0.019860086 0.009516737
## d 0.111927021 0.117692593 0.019860086 0.000000000 0.008167461
## e 0.097960496 0.104219359 0.009516737 0.008167461 0.000000000
Similarly, transformation of a counts matrix by clr also makes the data sub-compositionally coherent. In the calculation of proportionality coefficients, we use the variance about the clr-transformed data to normalize the variance of the log-ratios (vlr). In other words, we adjust the arbitrarily defined vlr by the variance of its individual constituents. In this way, the use of clr-transformed data shifts the vlr-matrix onto a “standardized” scale that compares across all feature pairs.
In the next figures, we compare pairwise scatterplots for the clr-transformed absolute count data and the corresponding clr-transformed relative count data. While equivalent, we see a relationship between “c” and “d” that should not exist based on what we know from the non-transformed absolute count data. This relationship is ultimately reflected (at least partially) in the results of phit
and perb
alike.
pairs(propr:::proprCLR(Y[, 1:4]))
pairs(propr:::proprCLR(X))
However, division of the vlr by the variance of the clr lacks sub-compositional coherence. As such, neither \(\phi\) nor \(\rho\), at least when calculated via clr, yield the same result for absolute and relative data. This may explain why these methods do not, per se, prevent the possible discovery of spurious proportionality.
phit(Y[, 1:4])@matrix
## Calculating phi from "count matrix".
## [,1] [,2] [,3] [,4]
## [1,] 0.000000 0.328171 4.1075015 4.0778951
## [2,] 0.328171 0.000000 3.9114296 3.7031104
## [3,] 4.107501 3.911430 0.0000000 0.5971697
## [4,] 4.077895 3.703110 0.5971697 0.0000000
phit(X)@matrix
## Calculating phi from "count matrix".
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.0000000 0.2388549 2.9895895 2.9680409 2.5976815
## [2,] 0.2388549 0.0000000 2.9298206 2.7737810 2.4562436
## [3,] 2.9895895 2.9298206 0.0000000 0.8050362 0.3857646
## [4,] 2.9680409 2.7737810 0.8050362 0.0000000 0.3564512
## [5,] 2.5976815 2.4562436 0.3857646 0.3564512 0.0000000
perb(Y[, 1:4])@matrix
## Calculating rho from "count matrix".
## [,1] [,2] [,3] [,4]
## [1,] 1.0000000 0.8479235 -0.8571942 -0.9020354
## [2,] 0.8479235 1.0000000 -0.9113638 -0.8627917
## [3,] -0.8571942 -0.9113638 1.0000000 0.6928331
## [4,] -0.9020354 -0.8627917 0.6928331 1.0000000
perb(X)@matrix
## Calculating rho from "count matrix".
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.0000000 0.8876058 -0.8072883 -0.8462492 -0.8459643
## [2,] 0.8876058 1.0000000 -0.8526537 -0.8011329 -0.8035079
## [3,] -0.8072883 -0.8526537 1.0000000 0.5826229 0.7622388
## [4,] -0.8462492 -0.8011329 0.5826229 1.0000000 0.7865827
## [5,] -0.8459643 -0.8035079 0.7622388 0.7865827 1.0000000
Still, in comparing the dependence between “c” and “d” as calculated by \(cov(Y)\) with that of \(\rho(Y)\), it appears that proportionality analysis does offer at least partial protection against spurious results.
cor(Y)
## a b c d e
## a 1.0000000 0.9918545 0.8606885 0.8700002 0.8630598
## b 0.9918545 1.0000000 0.8553602 0.8677473 0.8622694
## c 0.8606885 0.8553602 1.0000000 0.9857120 0.9923988
## d 0.8700002 0.8677473 0.9857120 1.0000000 0.9909547
## e 0.8630598 0.8622694 0.9923988 0.9909547 1.0000000
perb(Y)@matrix
## Calculating rho from "count matrix".
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.0000000 0.8876058 -0.8072883 -0.8462492 -0.8459643
## [2,] 0.8876058 1.0000000 -0.8526537 -0.8011329 -0.8035079
## [3,] -0.8072883 -0.8526537 1.0000000 0.5826229 0.7622388
## [4,] -0.8462492 -0.8011329 0.5826229 1.0000000 0.7865827
## [5,] -0.8459643 -0.8035079 0.7622388 0.7865827 1.0000000
Finally, the reader should note that in this contrived example, \(\phi(X) = \phi(Y)\) and \(\rho(X) = \rho(Y)\), but only because the sum of the feature parts in the relative dataset can explain the whole of absolute dataset. In other words, this comes from the fact that in crafting the relative dataset, we used information spanning the entire absolute dataset (i.e., rowSums
). This is usually not the case when studying biological count data and alone does not imply sub-compositional coherence.
Unlike the centered log-ratio (clr) which adjusts each subject vector by the geometric mean of that vector, the additive log-ratio (alr) adjusts each subject vector by the value of one its own components, chosen as a reference. If we select as a reference some feature \(D\) with an a priori known fixed absolute count across all subjects, we can effectively “back-calculate” absolute data from relative data. When initially crafting the data \(X\), we included “e” as this fixed value.
The following figures compare pairwise scatterplots for alr-transformed relative count data (i.e., \(alr(Y)\) with “e” as the reference) and the corresponding absolute count data. We see here how alr-transformation eliminates the spurious correlation between “c” and “d”.
pairs(propr:::proprALR(Y, ivar = 5))
pairs(X[, 1:4])
Again, this gets reflected in the results of perb
when we select “e” as the reference.
perb(Y, ivar = 5)@matrix
## Calculating rho from "count matrix".
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.00000000 0.95544861 -0.04896295 -0.05464219 0
## [2,] 0.95544861 1.00000000 -0.09299877 -0.04720992 0
## [3,] -0.04896295 -0.09299877 1.00000000 -0.12304138 0
## [4,] -0.05464219 -0.04720992 -0.12304138 1.00000000 0
## [5,] 0.00000000 0.00000000 0.00000000 0.00000000 1
Now, let us assume these same data, \(X\), actually measure relative counts. In other words, \(X\) is already relative and we do not know the real quantities which correspond to \(X\) absolutely. Well, if we knew that “a” represented a known fixed quantity, we could use alr-transformation again to “back-calculate” the absolute abundances. In this case, we will see that “c”, “d”, and “e” actually do have proportional expression under these conditions. Although the measured quantity of “c”, “d”, and “e” do not change considerably across subjects, the measured quantity of the known fixed feature does change. As such. whenever “a” increases while “c”, “d”, and “e” remains the same, the latter three features have actually decreased. Since they all decreased together, they act as a highly proportional module.
pairs(propr:::proprALR(X, ivar = 1))
Again, this gets reflected in the results of perb
when we select “a” as the reference.
perb(X, ivar = 1)@matrix
## Calculating rho from "count matrix".
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 0.00000000 0.00000000 0.00000000 0.00000000
## [2,] 0 1.00000000 -0.02107964 0.02680645 0.02569491
## [3,] 0 -0.02107964 1.00000000 0.91160199 0.95483279
## [4,] 0 0.02680645 0.91160199 1.00000000 0.96108648
## [5,] 0 0.02569491 0.95483279 0.96108648 1.00000000
We can visualize this module using the bundled visualization method dendrogram
.
dendrogram(perb(X, ivar = 1))
## Calculating rho from "count matrix".
## Alert: Generating plot using all feature pairs.
##
## Call:
## fastcluster::hclust(d = dist)
##
## Cluster method : complete
## Number of objects: 5
Resuming our initial claim that the matrix \(X\) contains absolute count data while the matrix \(Y\) contains relative count data, we can show that alr-transformation not only corrects for spurious proportionality, but it also serves as a sub-compositionally coherent metric of dependence. However, unlike the aforementioned vlr, \(\rho\) has a meaningful scale. In the example below, we calculate \(\rho\) using the alr-transformation about the reference “e” for four compositions of the relative count matrix, \(Y\), as well as for the absolute count matrix, \(X\). We see here that, unlike clr-transformed proportionality metrics, the alr-transformed metric \(\rho\) yields identical results regardless of the nature of the data explored. Of course, this assumes that one knows the identity of a feature fixed across all subjects. Still, at this point, one might also consider “back-calculating” the absolute abundances and measuring dependence through more conventional means.
perb(Y[, 2:5], ivar = 4)@matrix
## Calculating rho from "count matrix".
## [,1] [,2] [,3] [,4]
## [1,] 1.00000000 -0.09299877 -0.04720992 0
## [2,] -0.09299877 1.00000000 -0.12304138 0
## [3,] -0.04720992 -0.12304138 1.00000000 0
## [4,] 0.00000000 0.00000000 0.00000000 1
perb(X, ivar = 5)@matrix
## Calculating rho from "count matrix".
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.00000000 0.95544861 -0.04896295 -0.05464219 0
## [2,] 0.95544861 1.00000000 -0.09299877 -0.04720992 0
## [3,] -0.04896295 -0.09299877 1.00000000 -0.12304138 0
## [4,] -0.05464219 -0.04720992 -0.12304138 1.00000000 0
## [5,] 0.00000000 0.00000000 0.00000000 0.00000000 1
Although we developed this package with biological count data in mind, many of the ostensibly compositional biological datasets do not behave in a truly compositional manner. For example, in the setting of gene expression data, measuring the expression of “Gene A” as 1 in one subject and the expression of “Gene B” as 2 in another subject (i.e., the feature vector \([1, 2]\)), does not carry the same information as measuring the expression of “Gene A” as 1000 in one subject and the expression of “Gene B” as 2000 in another subject (i.e., the feature vector \([1000, 2000]\)). As such, these data do not strictly meet the criteria for compositional data. Unfortunately, we do not yet have a model to adequately address this drawback. Therefore, we advise the investigator to proceed with caution when working with such “count compositional” data.
Erb, Ionas, and Cedric Notredame. “How Should We Measure Proportionality on Relative Gene Expression Data?” Theory in Biosciences = Theorie in Den Biowissenschaften 135, no. 1–2 (June 2016): 21–36. .
Lovell, David, Vera Pawlowsky-Glahn, Juan José Egozcue, Samuel Marguerat, and Jürg Bähler. “Proportionality: A Valid Alternative to Correlation for Relative Data.” PLoS Computational Biology 11, no. 3 (March 2015): e1004075. .